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内 容 简 介 
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本 书 体例 新 颖 ， 适 合 教 学 。 每 个 单元 均 包 含 以 下 部 分 : 课文 一 一 选材 广泛 、 风 格 多 样 、 切 
合 实际 的 两 篇 专业 文章 ;单词 一 一 给 出 课文 中 出 现 的 新 词 ， 读 者 由 此 可 以 积累 大 数据 专业 的 基 
本 词汇 ; 词组 一 一 给 出 课文 中 的 常用 词组 ; 缩 略语 一 一 给 出 课文 中 出 现 的 、 业 内 人 士 必须 掌握 的 
缩 略语 ; 难 句 讲解 一 一 讲解 课文 中 出 现 的 疑难 句子 ， 分 析 其 语法 结构 ， 培 养 读 者 的 阅读 理解 疑 
难 句子 的 能 力 ; 习题 一 一 既 有 针对 课文 的 练习 ， 也 有 一 些 开 放 性 的 练习 ; 短文 翻译 一 一 培养 读者 
的 翻译 能 力 ， 参 考 译文 一 一 让 读者 对 照 理 解 以 提高 翻译 能 力 。 

本 书 吸纳 了 作者 近 20 年 的 IT 行业 英语 翻译 与 图 书 编写 经 验 ， 与 课堂 教学 的 各 个 环节 紧密 
结合 ， 支 持 备 课 、 教 学 、 复 习 及 考试 各 个 教学 环节 ， 有 配套 的 PPT、 参 考 答案 等 。 

本 书 既 可 作为 高 等 本 科 院 校 、 高 等 专科 院 校 大 数据 相关 专业 的 专业 英语 教材 ， 也 可 供 从 业 
人 员 自 学 ， 作 为 培训 班 教材 ， 亦 颇 得 当 。 
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我 们 正在 从 信息 技术 时 代 进 入 数据 技术 时 代 。 我 国 的 大 数据 产业 已 经 进入 高 速 发 
展期 ， 许 多 高 校 都 开设 了 大 数据 专业 ， 培 养 急需 的 专业 人 员 。 由 于 大 数据 产业 有 极 高 
的 发 展 速度 ， 从 业 人 员 必 须 掌 握 许多 新 技术 、 新 方法 ， 因 此 对 专业 英语 要 求 较 高 。 具 
备 相关 技能 并 精通 专业 外 语 的 人 员 往 往 会 赢得 竞争 ， 成 为 职场 中 不 可 或 缺 的 核心 人 才 
与 领军 人 物 。 

本 书 的 特点 与 优势 如 下 

CD 选材 全 面 ， 包 括 大 数据 基础 、 软 件 与 开发 技术 、 操 作 系统 、Python 与 R 编程 语 
言 、 数 据 结构 、 数 据 库 与 数据 仓库 、 云 存储 与 数据 备份 、 数 据 处 理 与 数据 清洗 、 数 据 挖 
~ Hadoop 与 Spatk、 数 据 可 视 化、 大 数据 安全 等 。 书 中 许多 内 容 非常 实用 ， 具 有 广泛 
的 覆盖 面 。 

(OD 体例 新 颖 ， 非 常 适合 教学 ， 与 课堂 教学 的 各 个 环节 紧密 结合 ， 支 持 备课 、 教 学 、 
复习 及 考试 各 个 教学 环节 。 每 个 单元 均 包含 以 下 部 分 : 课文 一 一 选材 广泛 、 风 格 多 样 、 切 
合 实际 的 两 篇 专业 文章 ; 单词 一 一 给 出 课文 中 出 现 的 新 词 ， 读 者 由 此 可 以 积累 大 数据 专业 
的 基本 词汇 ; 词组 一 一 给 出 课文 中 的 常用 词组 ; 缩 略语 一 一 给 出 课文 中 出 现 的 、 业 内 人 士 
必须 掌握 的 缩 略语 ; 难 句 讲解 一 一 讲解 课文 中 出 现 的 疑难 句子 ， 分 析 其 语法 结构 ， 培 养 
读者 的 阅读 理解 疑难 句子 的 能 力 ; 习题 一 一 既 有 针对 课文 的 练习 ， 也 有 一 些 开 放 性 的 练 
习 ; 短文 翻译 一 一 培养 读者 的 翻译 能 力 ; 参考 译文 一 一 让 读者 对 照 理解 以 提高 翻译 能 力 。 

(3) 习题 量 适当 ， 题 型 丰富 ， 难 易 搭 配 ， 便 于 教师 组 织 教 学 。 

(4) 教学 支持 完善 ， 有 配套 的 PPT、 参 考 答案 等 。 

(5) 作者 有 近 20 年 IT 行业 英语 图 书 的 编写 经 验 。 在 作者 编写 的 英语 书籍 中 ， 有 三 
部 国家 级 “十 一 五 ”规划 教材 ， 一 部 全 国 畅销 书 ， 一 部 获 华东 地 区 教材 二 等 奖 图 书 。 基 
于 这 些 图 书 的 编写 经 验 有 助 于 本 书 内 容 的 完善 与 提升 。 

在 使 用 本 书 的 过 程 中 ， 有 任何 问题 都 可 以 通过 电子 邮件 与 我 们 交流 ,我 们 一 定 会 给 
予 答复 。 邮 件 标题 请 注 明 姓名 及 “索取 大 数据 英语 参考 资料 ”字样 。 我 们 的 E-mail 地 址 
为 zqh3882355@sina.com 和 zqh3882355(2163.com. 

如 本 书 有 任何 不 妥 之 处 ， 望 大 家 不 音 赐教 ， 让 我 们 共同 努力 ， 使 本 书 成 为 一 部 “ 符 
合 学 生 实际 、 切 合 行业 实况 、 知 识 实用 丰富 、 严 说 开 放 创新 ”的 优秀 教材 。 
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Text A 


Big Data 


Big data is changing the way people work together within organizations. It is creating a 
culture in which business and IT leaders must join forces to realize the value from all data. 
Insights from big data can enable all employees to make better decisions — deepening 
customer engagement, optimizing operations, preventing threats and fraud, and capitalizing 
on new sources of revenue. 


1. The Big Vs 


11 Value 


This is indeed the holy grail of big data and what we are all looking for. One has to 
demonstrate value that can be extracted from big or small data in order to justify the 
investments, whether on big data or on traditional analytics, data warehouse or business 
intelligence tools, whatever may be the buzzing nomenclature. There seems to be an 
increasing interest related to the value of big data, as indicated by the number of Google 
searches looking for similar terms over the last two years. 


1.2 Volume 


There is no doubt that the information explosion has redefined the connotation of 


volumes. There are several such staggering statistics going around and it has become 
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increasingly difficult to keep track of the number and magnitude of the prefixes attached to 
“bytes” while measuring the volume. Since there is a “helluva lot of data" , the term 
“Hellabyte has been coined beyond Petabytes, Exabytes, Zettabytes and Yottabytes. 
However, since these measures will be superseded by the likes of Brontobytes, Geopbytes and 
more, lets move on! 


1.3 Velocity 


Similarly, velocity refers to the speed at which the data is generated. Some of the factors 
that exacerbate this trend are the proliferation of social media and the explosion of IoT 
(Internet of Things). In the context of business operations that have not yet been touched by 
social media or IoT, the velocity arises from sophisticated enterprise applications that capture 
each and every minute detail involved in the completion of a particular business process. 
Enterprise applications have traditionally captured such information but the world has woken 
up to the power of such information largely in the big data era. 


1.4 Variety 


The last of the original attributes of big data is variety. Since we are living in an 
increasingly digital world where technology has invaded into our glasses and watches, the 
variety of data that is generated is mind-boggling. The computing power available is able to 
process unstructured text, images, audio, video and data from sensors in the IoT (Internet of 
Things) world that capture (almost) everything around us. This attribute of big data is more 
relevant today than it ever was. 


1.5 Veracity or Validity 


Veracity or validity of data is extremely important and fundamental to the extraction of 
value from the underlying data. Veracity implies that the data is verifiable and truthful. If this 
condition is violated, the results can be catastrophic. More importantly, there are several cases 
in which the data is accurate but may not be valid in the particular context. For instance, if we 
are trying to ascertain the volume of searches on Google related to big data, we will also 
obtain results pertaining to the hit single “dangerous” from “big data” . 


1.6 Visible 


Information silos have always existed within enterprises and have been one of the major 
roadblocks in the attempt to extract value from data. Relevant information should not only 
exist, but also be visible to the right person at the right time. Actionable data needs to be 
visible transcending the boundaries of functions, departments and even organizations for 
value unlocking. Individuals might have believed that information in their hands is power but 
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in the age of big data, collective information available to the world at large is truly 


omnipotent! 
1.7 Visual 


We live in an increasingly visual world and the statistics of increase in the number of 
images and videos shared on the Internet is staggering. According to official statistics, 300 
hours of video are uploaded every minute on YouTube. In a business context, appropriate 
visualization of data is critical for the management to be able to extract value from their 
limited time, resources and even more limited attention span! 


2. More Contenders 


In addition to the 7 V's described above, there are several other V's that may be 
considered: 


2.4 Volatility 


With more applications such as SnapChat and IoT sensors, we may have data in and out 
in a snap. Volatility of the underlying data sources may become one of the defining attributes 
in the future. 


2.2 Variability 


One of the cornerstones of traditional statistics is standard deviation and variability. 
Whether or not it makes to an extended list of V's relating to big data, it can never be ignored. 


23 Viability 


Embedded in the concept of value is the need to check the viability of any project. Big 
data projects can scale up to gigantic proportions and guzzle a lot of resources very quickly. 
Those who do not learn this fast and get fascinated with fads will funnel funds towards futility 
resulting in failure. In a nutshell, viability of any project needs to be established and big data 
projects do not have the liberty of exemption, whether or not it remains a trending buzzword. 


24 Vitality 


Vitality or criticality of the data is another concept that is crucial and is embedded in the 
concept of Value. Information that is more meaningful or critical to the underlying business 
objective needs to be prioritized. Analysis paralysis needs to be replaced with a more 
pragmatic approach. Technology allows marketers to create segments of one, but is such 


extreme segmentation vital or even aligned to the organizational strategy? 
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2.5 Vincularity 


Derived from Latin, it implies connectivity or linkage. This concept is very relevant in 
today's connected world. There is significant value arbitrage potential by connecting diverse 
information sets. For instance, the government has forever been trying to connect the details 
of major expenditure heads and correlating the same with the income declared in tax returns 
to identify concealment of income. The same purpose may now be achieved by drawing 
information from social media posts. 


3. An Example of Big Data 


An example of big data might be petabytes (1,024 terabytes) or exabytes (1,024 
petabytes) of data consisting of billions to trillions of records of millions of people — all from 
different sources (e.g. Web, sales, customer contact center, social media, mobile data and so 
on). The data is typically loosely structured data that is often incomplete and inaccessible. 


wa New Words 
realize [rielaiz] vt. 认识 到 ， 了 解 ， 实现， 实行 
engagement [in geidzmant] nn. 参 与 度 ， 冤 业 度 
fraud [fro:d] n. 欺 骗 ， 欺诈 行为 
indeed [in'di:d] adv. 真 正 地 , 确实 ; 当然 
demonstrate [demenstreit] Vt. 示范 , 证明 , 论证 
nomenclature ^ [ne'menclet/a] nn. 系 统 命名 法 ; 命名 ; 术语 ; 专门 名 称 
analytics [aena'litiks] n. SE, 解析 学 ,分析 论 
redefine [ri:di'fain] Vv. 重 新 定义 
connotation [Koneu'teif an] n. p i 
staggering [staegarin] adj. 令 人 惊 情 的 ， 难 以 置信 的 
helluva [helave] adj. 很 大 的 
Hellabyte [helabait] nn 数据 单位 ，=10” Byte 
Exabyte ['eksebait] nn. 数 据 单位 ， 缩 写 为 EB 
Zettabyte [zetebait] 1. 数据 单位 ， 缩 写 为 ZB 
Yottabyte [jotəbait] 九 数据 单位 ， 缩 写 为 YB 
Brontobyte [brontebait] nn. 数 据 单位 ， 缩 写 为 BB 
Geopbyte [dsiapbait] nn. 数 据 单位 ， 缩 写 为 GB 
velocity [vi'lositi] .高 速 性 ; 速度 ,速率 
exacerbate [eks'eesebeit] wt 使 恶化 ， 使 加 剧 


trend [trend] niim, E39 


proliferation 
explosion 
era 

variety 
attribute 
mind-boggling 
unstructured 
veracity 
validity 
extremely 
fundamental 


verifiable 
truthful 
violate 
catastrophic 
visible 


transcend 
boundary 
omnipotent 
visualization 
span 
contender 
volatility 
variability 
cornerstone 
viability 
gigantic 
proportion 


guzzle 
fascinate 


fad 
funnel 


[preulife'reif en] 
[iks'pleuzen] 
[iere] 
[veraieti] 
[e'tribju(:)t] 
[maind-'boglin] 
[An'strAktf ed] 
[va'reesiti] 
[va'liditi] 
[iks'tri:mli] 
Lfanda'mental] 


[verifaiebl] 
[tru:8ful] 
[vaieleit] 
[kaeta strofik] 
[vizebl] 


[traen'send] 
[baunderi] 
[om'nipetent] 
[,vizjuelaizeif en] 
[spæn] 
[ken'tenda] 
Lvole'tiliti] 
[.verie biliti] 
[ko:nasteun] 
[. vaio biliti] 
[dzai'gaentik] 
[pre'po:f en] 


[gazi] 
[fæsineit] 


[feed] 
[fanal] 
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nn. 增殖 ; 扩散 

n. 爆 发 ， 爆 炸 

.时 代 ， 纪 元， 时 期 

.多 样 性 ; 品种 ， 种 类 

nn. 属 性， 品质 ， 特 征 

adj. 令 人 难以 置信 的 
adj. 非 结构 化 的 ， 未 组 织 的 
.真实 性 

nA XE. 合法 性 ,正确 性 
adv. 极 端 地 ， 非 常 地 
adj. 基 础 的 ， 基 本 的 

nn. 基 本 原则 ， 基 本 原理 
adj. fi E 3c th 

adj. 诚 实 的 ， 说 实话 的 

wt 违犯 , 冒犯 , 和 干扰; 违反 
adj. 悲 惨 的 ， 灾 难 的 
adj. 看 得 见 的 ， 明 显 的 ， 显 著 的 
n.7 4n 

wt 超越 ， 胜 过 

n. 边 界 ， 分 界线 

adj. 全 能 的 ， 无 所 不 能 的 

n. 可 视 化 

nn. 跨 度 ， 跨 距 ， 范围 

nn. 竞争 者 

n. 波 动 率 ; 波动 性 ; 波动 

nn. 变 异性; 可 变性 

nn. 黄 基石， 基础 ， 最 重要 的 部 分 
.可 行 性 ,切实 可 行 ， 能 办 到 ; 生存 能 力 
ad. BAH, BAW 

nn. 比例 ; 均衡 ; 部 分 

vt 使 成 比例 ; 使 均衡 ,分 摊 
WARK, RRs 消耗 

vt.fE BR, Ro RL 
vi AK, 极度 迷人 的 

.时 尚 ， 一 时 流行 的 狂热 ， 一 时 的 爱好 
vt.& vi. 把 …… 灌 进 漏斗 ; 使 成 漏斗 状 ; 成 漏斗 形 ; 
使 汇集 

n. 漏 斗 ; 漏斗 状 物 
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[fju: tileti] 
[natfel] 
[igzemp[en] 
[vai teeliti] 
[kriti’keeliti] 
[prai'oritaiz] 
[preeg'metik] 
[a:bitridz] 
[korileit] 


futility 
nutshell 
exemption 
vitality 
criticality 
prioritize 
pragmatic 
arbitrage 
correlate 


incomplete Linkem'pli:t] 


XWA Phrases 


big data 

capitalize on 

holy grail 

extracted ... from 
data warehouse 
business intelligence tool 
information explosion 
be superseded by ... 
wake up 

invade into 
unstructured text 
underlying data 
pertain to 

in the attempt to 

at large 

according to 

ina snap 

standard deviation 
scale up 

get fascinated with 
in a nutshell 
analysis paralysis 
be replaced with 

be aligned to 


nn. 无 益 ， 无 用 

nia, -a Uik 

nn 解除 ， 免 除 

nn. 时 效 性 ; 动态 性 ， 灵 活 

nn 临界 点 ; 临界 状态 ; 紧急 程度 ， 危 险 程度 


adj. 实 际 的 ， 注 重 实效 的 
nn. 套 汇 ， 套 利 交 易 
好 使 相互 关联 


adj. 不 完全 的 ， 不 完善 的 


大 数据 

充分 利用 ; 资本 化 

圣杯 ; HAAR HED E, 
从 …… 中 抽取 ， 从 …… 中 提取 


努力 却 无 法 得 到 的 东西 


商业 智能 工具 
信息 爆炸 ， 知 识 爆炸 


活跃 起 来 ; 引起 注意 ; (使 ) 认识 到 
侵入 

非 结 构 化 文本 

源 数据 ; 基础 数据 ; 基本 数据 
ET, 关于， 附属 

KA, 企图 

普遍 的 ; 一 般 的 ; 整体 的 
依照 

立刻 , 马上 
标准 差 ， 标 准 偏差 

按 比例 增加 ， 按 比例 提高 

xk b, WHT 

简 言 之 ， 一 言 以 项 之 
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be derived from KA, HF 

draw from... MK 抽取 

consist of 构成 ， 组 成 

customer contact center 客户 联络 中 心 ， 客 户 服务 中 心 
从 Abbreviations 

IT (Information Technology) 信息 技术 

IoT (Internet of Things) 物 联 网 
XA Notes 


[1] One has to demonstrate value that can be extracted from big or small data in order to 
justify the investments, whether on big data or on traditional analytics, data warehouse or 
business intelligence tools, whatever may be the buzzing nomenclature. 

本 句 中 , that can be extracted from big or small data 是 一 个 定语 从 句 , 修饰 和 限定 value. 
in order to justify the investments, whether on big data or on traditional analytics, data 
warehouse or business intelligence tools 是 一 个 目的 状语 从 句 ， 修 饰 主 句 的 谓语 
demonstrate. whatever may be the buzzing nomenclature 是 一 个 让 步 状 语 从 句 。 

[2] In the context of business operations that have not yet been touched by social media or IoT, 
the velocity arises from sophisticated enterprise applications that capture each and every 


minute detail involved in the completion of a particular business process. 
本 名 中, that have not yet been touched by social media or IoT 是 一 个 定语 从 句 , 修饰 和 
限定 business operations. that capture each and every minute detail involved in the 
completion of a particular business process 也 是 一 个 定语 从 句 ， 修 饰 和 限定 enterprise 
applications。 在 该 从 句 中 ，involved in the completion of a particular business process 
是 一 个 过 去 分 词 短语 ， 做 后 置 定语 ， 修 饰 和 限定 each and every minute detail。 

[3] Since we are living in an increasingly digital world where technology has invaded into our 
glasses and watches, the variety of data that is generated is mind-boggling. 
本 句 中 ，Since we are living in an increasingly digital world where technology has 
invaded into our glasses and watches 是 一 个 原因 状语 从 句 ， 修 饰 和 限定 主 句 的 谓语 is 
mind-boggling. 在 该 从 句 中 , where technology has invaded into our glasses and watches 
也 是 一 个 定语 从 句 ， 修 饰 和 限定 digital world. that is generated 是 一 个 定语 从 句 ， 修 
饰 和 限定 the variety of data. 

[4] Embedded in the concept of value is the need to check the viability of any project. 
本 句 是 一 个 表 语 前 置 的 倒 装 句 。the need to check the viability of any project 是 主语 ， 
Embedded in the concept of value 是 表 语 .正常 语序 应 为 : The need to check the viability 
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of any project is embedded in the concept of value. 


XA Exercises 


【Ex. 1】 根据 课文 内 容 回答 问题 。 

1. What can insights from big data do? 

2. What does velocity refer to? What are some of the factors that exacerbate this trend? 

3. Why is the variety of data that is generated is mind-boggling? 

4. What does veracity imply? 

5. What have always existed within enterprises and have been one of the major roadblocks in 
the attempt to extract value from data? 

6. What should relevant information be? 

7. How many hours of video are uploaded every minute on YouTube according to official 
statistics? 

8. What is one of the cornerstones of traditional statistics? 

9. What kind of information needs to be prioritized? 

10. Where is the word vincularity derived from? What does it imply? 


【Ex. 2】 把 下 列 句 子 翻译 为 中 文 。 

1. I hope that this talk has given you some insight into the kind of the work that we've been 
doing. 

2. The new systems have been optimized for running Microsoft Windows. 

3. These designs demonstrate her unerring eye for colour and detail. 

4. Let me make this clear: A bar chart is not analytics. 

5. A good dictionary will give us the connotation of a word as well as its denotation. 

6. The latest lifestyle trend is downshifting. 

7. The end of an era presupposes the start of another. 

8.You cannot combine structured and unstructured exception handling in the same function. 

9. Finally, the practical application shows the feasibility and veracity of this approach. 

10. The viability of multilayer switches depends on the protocol supported. 


[Ex 3] 短文 翻译 。 

Cloud computing is a general term for anything that involves delivering hosted services 
over the Internet. These services are broadly divided into three categories: Infrastructure- 
as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-as-a-Service (SaaS). The 
name cloud computing was inspired by the cloud symbol that’s often used to represent the 
Internet in flowcharts and diagrams. 
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A cloud service has three distinct characteristics that differentiate it from traditional 
hosting. It is sold on demand, typically by the minute or the hour; it is elastic — a user can 
have as much or as little of a service as they want at any given time; and the service is fully 
managed by the provider (the consumer needs nothing but a personal computer and Internet 
access). Significant innovations in virtualization and distributed computing, as well as 
improved access to high-speed Internet and a weak economy, have accelerated interest in 
cloud computing. 


[Ex.4] 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


update media reduced meaningful challenge 
multiple data recent messages explosion 
Volume 


We currently see the exponential growth in the data storage as the data is now more 
than text data. We can find data in the format of videos, musics and large images on our 
social (1) channels. It is very common to have Terabytes and Petabytes of the 
storage system for enterprises. As the database grows the applications and 
architecture built to support the — (2) — needs to be reevaluated quite often. Sometimes 
the same data is reevaluated with multiple angles and even though the original data is the 
same the new found intelligence creates _ (3) — of the data. The big volume indeed 
represents Big Data. 


Velocity 


The data growth and social media explosion have changed how we look at the data. 
There was a time when we used to believe that data of yesterday is _ (4) _. The matter of the 
fact newspapers is still following that logic. However, news channels and radios have changed 
how fast we receive the news. Today, people reply on social media to _ (5) — them with the 
latest happening. On social media sometimes a few seconds old messages (a tweet, status 
updates etc.) is not something interests users. They often discard old — (6) _ and pay 
attention to recent updates. The data movement is now almost real time and the update 
window has (7) to fractions of the seconds. This high velocity data represent Big Data. 


Variety 


Data can be stored in _ (8) _ format. For example DataBase, Excel, CSV, ACCESS or 
for the matter of the fact, it can be stored in a simple text file. Sometimes the data is not even 
in the traditional format as we assume, it may be in the form of video, SMS, pdf or something 
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we might have not thought about it. It is the need of the organization to arrange it and make 
it. (9) .It will be easy to do so if we have data in the same format, however it is not the 
case most of the time. The real world have data in many different formats and that is 
the (10) | we need to overcome with the Big Data. This variety of the data represent Big 
Data. 


Text B 


Big Data Analytics 


Big data analytics is the process of collecting, organizing and analyzing large sets of data 
(called big data) to discover patterns and other useful information. Big data analytics can help 
organizations to better understand the information contained within the data and will also help 
identify the data that is most important to the business and future business decisions. Analysts 
working with big data basically want the knowledge that comes from analyzing the data. 


ZZ Orange 
decisions 


c 
pou GD 
2 ; ways 
e improve nancured 
Development 


=== | organization % 


projects 


1. Big Data Requires High Performance Analytics 


To analyze such a large volume of data, big data analytics is typically performed using 
specialized software tools and applications for predictive analytics, data mining, text mining, 
forecasting and data optimization. Collectively these processes are separate but highly 
integrated functions of high-performance analytics. Using big data tools and software enables 
an organization to process extremely large volumes of data that a business has collected to 
determine which data is relevant and can be analyzed to drive better business decisions in the 
future. 
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2. The Challenges of Big Data Analytics 


For most organizations, big data analysis is a challenge. Consider the sheer volume of 
data and the different formats of the data (both structured and unstructured data) that is 
collected across the entire organization and the many different ways different types of data 
can be combined, contrasted and analyzed to find patterns and other useful business 
information. 

The first challenge is in breaking down data silos to access all data an organization stores 
in different places and often in different systems. A second big data challenge is in creating 
platforms that can pull in unstructured data as easily as structured data. This massive volume 
of data is typically so large that it's difficult to process using traditional database and software 
methods. 


3. How Big Data Analytics is Used Today 


As the technology that helps an organization to break down data silos and analyze data 
improves, business can be transformed in all sorts of ways. According to Datamation, today's 
advances in analyzing big data allow researchers to decode human DNA in minutes, predict 
where terrorists plan to attack, determine which gene is mostly likely to be responsible for 
certain diseases and, of course, which ads you are most likely to respond to on Facebook. 

Another example comes from one of the biggest mobile carriers in the world. France's 
Orange launched its Data for Development project by releasing subscriber data for customers 
in the Ivory Coast. The 2.5 billion records, which were made anonymous, included details on 
calls and text messages exchanged between 5 million users. Researchers accessed the data and 
sent Orange proposals for how the data could serve as the foundation for development 
projects to improve public health and safety. Proposed projects included one that showed how 
to improve public safety by tracking cell phone data to map where people went after 
emergencies; another showed how to use cellular data for disease containment. 


4. The Benefits of Big Data Analytics 


Enterprises are increasingly looking to find actionable insights into their data. Many big 
data projects originate from the need to answer specific business questions. With the right big 
data analytics platforms in place, an enterprise can boost sales, increase efficiency, and 
improve operations, customer service and risk management. 

Webopedia parent company, QuinStreet, surveyed 540 enterprise decision-makers 
involved in big data purchases to learn which business areas companies plan to use Big Data 
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analytics to improve operations. About half of all respondents said they were applying big 
data analytics to improve customer retention, help with product development and gain a 
competitive advantage. 

Notably, the business area getting the most attention relates to increasing efficiency and 
optimizing operations. Specifically, 62 percent of respondents said that they use big data 
analytics to improve speed and reduce complexity. 


5. Top 10 Hot Big Data Technologies 


As the big data analytics market rapidly expands to include mainstream customers, which 
technologies are most in demand and promise the most growth potential? The answers can be 
found in TechRadar: Big Data, Q1 2016, a new Forrester Research report evaluating the 
maturity and trajectory of 22 technologies across the entire data life cycle. The winners all 
contribute to real-time, predictive, and integrated insights, what big data customers want now. 

Here is my talk on the 10 hottest big data technologies based on Forrester's analysis: 


5.1 Predictive analytics 


Software and/or hardware solutions that allow firms to discover, evaluate, optimize, and 
deploy predictive models by analyzing big data sources to improve business performance or 
mitigate risk. 


5.2 NoSQL databases 
Key-value, document, and graph databases. 
5.3 Search and knowledge discovery 


Tools and technologies to support self-service extraction of information and new insights 
from large repositories of unstructured and structured data that resides in multiple sources 
such as file systems, databases, streams, APIs, and other platforms and applications. 


5.4 Stream analytics 


Software that can filter, aggregate, enrich, and analyze a high throughput of data from 
multiple disparate live data sources and in any data format. 


5.5 In-memory data fabric 


Provides low-latency access and processing of large quantities of data by distributing 
data across the dynamic random access memory (DRAM), Flash, or SSD of a distributed 
computer system. 
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5.6 Distributed file stores 


A computer network where data is stored on more than one node, often in a replicated 


fashion, for redundancy and performance. 
5.7 Data virtualization 


A technology that delivers information from various data sources, including big data 
sources such as Hadoop and distributed data stores in real-time and near-real time. 


5.8 Data integration 


Tools for data orchestration across solutions such as Amazon Elastic MapReduce (EMR), 
Apache Hive, Apache Pig, Apache Spark, MapReduce, Couchbase, Hadoop, and MongoDB. 


5.9 Data preparation 


Software that eases the burden of sourcing, shaping, cleansing, and sharing diverse and 
messy data sets to accelerate data's usefulness for analytics. 


5.10 Data quality 


Products that conduct data cleansing and enrichment on large, high-velocity data sets, 
using parallel operations on distributed data stores and databases. 


X New Words 

analytic [.&ene'litik] adj. 81, 解析 的 

predictive [pri diktiv] ad HE EB, RA MEA 

forecasting [fo:ka:stin] n. fi 3l 

collectively [ke'lektivli] adv. 全 体 地 ， 共 同 地 

sheer [fia] adj. 全 然 的 ， 纯 粹 的 ， 绝 对 的 

combine [kam'bain] Vt 组合， 结合 

contrast [kontraest] Vt 使 与 …… 对 比 ， 使 与 …… 对 照 
vides 形成 对 照 
nb, AE, CAPER oP A) ESP 

silo [saileu] n.E3t 

datamation [.deite'meif ən] n. B zh ft XR Ab x8 

researcher [ri'sa:tfa] nn. 研究 者 

predict [pri'dikt] vm, Sx, TR 

terrorist [tererist] n Re o 


gene [dsi:n] n. Gk ]3E IH 


(4) 大 数据 专业 英语 教程 


disease [di'zi:z] nF, "n 
anonymous [e nonimes] adj. 匿 名 的 
emergency [ima:d3nsi] .紧急 情况 ， 突 然 事 件 ， 非 常 时 刻 ， 紧 急事 件 
containment [kanteinmant] nizh, EARE 
boost [bu:st] vy. 推进 
retention [ritenf en] nn. 保 持 力 
mainstream ['meinstri:m] nn. 主流 
maturity [me'tjueriti] nn. 成 熟 ， 完 备 
trajectory [‘treedziktari] .轨道 ， 弹 道 
mitigate ['mitigeit] ve 
self-service [self- sa:vis] n. BBX 
enrich [inritf] Wt 浓缩 
low-latency [leu-leitensi] 11. 低 反应 期 ， 短 反应 时 间 
orchestration  [,2:ki'streif en] 1711. 管弦 乐 编 曲 
burden [ba:dan] .担子 ,负担 
vý 

messy [mesi] adj ALI, REL 
usefulness [ju:sfulnis] 1. 有用， 有 效 性 

他 Phrases 
high performance 高 性 能 ， 高 精确 度 
text mining 文本 挖掘 
in breaking down 在 打破 …… 
data silo 数据 竖井 ， 数 据 孤 岛 
subscribe for TT, AW 
text message 短信 ， 短 消息 
cell phone 手机 
disease containment 疾病 控制 
originate from 发 源 于 
customer service 客户 服务 
competitive advantage 竞争 优势 
life cycle 生命 周 其 
business performance 经 营 成 绩 ， 经 营业 绩 
multiple source 多 个 来 源 ， 复 合 源 
distributed computer system 分 布 式 计算 机 系统 


parallel operation 平行 工作 
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X Abbreviations 


DNA (Deoxyribonucleic Acid) 脱氧 核糖 核酸 
SSD (Solid State Drives) 固态 硬盘 


XW Exercises 


【 Ex. 5 】 根据 课文 内 容 回答 问题 。 

(1) What is big data analytics? 

(2) What can big data analytics do? 

(3) How is big data analytics typically performed to analyze such a large volume of data? 
(4) What is the first challenge of big data analytics? 

(5) What is a second big data challenge? 

(6) What do today’s advances in analyzing big data allow researchers to do according 
to Datamation? 

(7) What do many big data projects originate from? 

(8) What can an enterprise do with the right big data analytics platforms in place? 

(9) What does the business area getting the most attention relate to? 

(10) What is the last part of the passage mainly talk about? 


参考 译文 


大 数据 
大 数据 正在 改变 组 织 内 部 人 们 协同 工作 的 方式 。 它 正在 创造 一 种 文化 ， 使 得 业务 和 
IT 领导 者 必须 联合 起 来 , 以便 实 现 所 有 数据 的 价值 。 大 数据 让 所 有 员工 能 够 更 好 地 做 出 
决策 一 一 包括 深化 客户 参与 度 、 优化 运营 、 防止 威胁 和 欺诈 行为 以 及 开辟 新 的 收入 来 源 。 


RV 
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这 的 确 是 大 数据 的 梦想 ， 也 是 我 们 在 寻找 的 目标 。 从 大 大 小 小 的 数据 中 获得 价值 以 
证 明 投资 所 值 ， 无 论 是 大 数据 分 析 或 传统 分 析 、 数 据 仓库 或 商业 智能 工具 ,或 许 只 是 不 
同 的 名 称 而 已 。 根 据 谷歌 搜索 过 去 两 年 寻找 类 似 条 目的 数量 ,似乎 表明 人 们 对 大 数据 的 
价值 越 来 越 感 兴趣 。 
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12 数据 量 


毫 无 疑问 ， 信 息 爆炸 已 经 重新 定义 了 数据 量 的 含义 。 有 几 个 如 此 惊人 的 统计 数字 ， 
要 跟踪 数据 越 来 越 难 了 ， 要 度量 这 样 的 数据 需要 给 “ 字 节 ”前 面 加 上 种 种 前 级 。 因 为 有 
“ 巨 量 的 数据 ”新 创造 出 的 术语 “Hellabyte” 已 经 超越 PB. EB. ZB 和 YB。 然而， 
这 些 度量 单位 将 被 Brontobytes、Geopbyte 等 替代 ， 让 我 们 继续 吧 ! 


13 高速 性 


同样 地 ， 高 速 性 是 指 产生 数据 的 速度 。 社 交 媒 体 的 扩散 和 IoT ( 物 联网 ) 的 爆炸 式 
增长 是 加 剧 这 一 趋势 的 一 些 因素 。 在 尚未 被 社交 媒体 或 物 联网 影响 的 业务 运营 中 ， 时 效 
性 来 自 复杂 的 企业 应 用 ， 它 捕捉 了 每 一 个 特定 业务 流程 的 每 一 个 微小 的 细节 。 传 统 上 企 
业 应 用 也 捕获 这 些 信息 ， 但 在 大 数据 时 代 ， 这 些 信息 就 是 力量 。 


14 多 样 性 


大 数据 的 最 后 一 个 原始 属性 是 多 样 性 。 既 然 我 们 生活 在 一 个 日 益 数字 化 的 世界 里 ， 
技术 已 经 侵入 我 们 的 眼镜 和 手表 ， 多 样 性 所 产生 的 数据 是 令 人 难以 置信 的 。 可 用 的 计算 
能 力 能 够 处 理 非 结 构 化 的 文本 、 图 像 、 音 频 、 视 频 以 及 来 自 物 联网 传感器 的 数据 ， 这 几 
乎 可 以 捕获 我 们 周围 的 一 切 。 今 天 ， 大 数据 的 这 个 属性 与 我 们 现在 的 生活 的 联系 比 以 往 
更 紧密 。 


1.5 真实 性 或 有 效 性 


数据 的 真实 性 或 有 效 性 对 提取 基础 数据 的 价值 非常 重要 。 真 实 性 意味 着 数据 是 可 验 
证 的 和 真实 的 。 如 果 违反 这 个 条 件 , 其 结果 可 能 是 灾难 性 的 。 更 重要 的 是 , 有 几 种 情况 ， 
其 中 数据 虽然 准确 但 在 特定 情况 下 无 效 。 例 如 ， 如 果 试 图 确定 谷歌 中 “大 数据 ”的 搜索 
量 ， 我 们 也 会 获得 有 关 “ 大 数据 ”的 “危险 ”的 结果 。 


16 可 见 性 


信息 孤岛 一 直 在 企业 中 存在 ， 并 且 一 直 是 从 数据 中 提取 价值 的 主要 障碍 之 一 。 不 仅 
应 该 有 相关 信息 ， 而 且 应 该 在 合适 的 时 间 给 合适 的 人 看 到 。 可 操作 的 数据 需要 超越 职能 
部 门 甚至 组 织 的 界限 ， 并 被 其 所 见 ， 才 能 释放 数据 的 价值 。 个 体 可 能 会 认为 在 他 们 手中 
的 信息 就 是 力量 ， 但 在 大 数据 时 代 ， 大 量 的 对 全 球 有 效 的 整合 信息 才 真正 无 所 不 能 ! 


1.7 视觉 性 


我 们 生活 在 一 个 日 益 视觉 化 的 世界 ， 统 计数 据 表 明 ， 在 互联 网 上 共享 的 图 像 和 视频 
的 数量 以 惊人 的 速度 增加 。 据 官方 统计 ， 每 分 钟 有 300 小 时 的 视频 被 上 传 到 YouTube. 
在 商业 环境 中 ， 适 当 的 可 视 化 数据 对 管理 者 是 至 关 重要 的 ， 他 们 能 够 在 有 限 的 时 间 和 资 
源 甚至 更 有 限 的 注意 力 中 获得 价值 ! 
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2. 更 多 的 属性 
除了 上 述 的 7V， 可 能 还 有 其 他 几 个 V. 
2.1 波动 率 


随 着 越 来 越 多 的 应 用 (如 SnapChat 和 物 联 网 传感器 ) 出 现 , 可 能 即时 产生 一 些 输入 
和 输出 数据 。 基 础 数据 源 的 波动 率 将 来 可 能 成 为 其 定义 属性 之 一 。 
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传统 统计 的 一 个 基石 是 标准 差 和 变异 。 无 论 它 在 不 在 大 数据 的 扩展 列表 中 ， 都 绝 不 
能 被 忽略 。 


2.3 可 行 性 


每 个 项 目的 可 行 性 都 需要 检查 ， 这 包含 在 价值 概念 之 中 。 大 数据 项 目 可 占据 巨大 的 
比例 并 非常 快 地 消耗 大 量 资源 。 谁 不 快速 学 习 并 沉迷 时 尚之 中 ， 就 会 耗 尽 资金 而 失败 。 
简 言 之 ， 任 何 项 目 都 要 进行 可 行 性 研究 ， 大 数据 项 目 也 不 例外 ， 无 论 它 是 否 仍然 是 一 个 
流行 词 。 

2.4 时 效 性 


数据 的 时 效 性 或 关键 性 是 另 一 至 关 重 要 的 概念 ， 它 包含 在 价值 概念 之 中 。 应 该 优先 
考虑 对 实现 基础 商业 目标 更 有 意义 或 更 重要 的 信息 。 需 要 用 更 务实 的 方法 来 取代 过 度 分 
析 。 技 术 允 许 营销 人 员 创 建 一 个 片段 ,但 这 样 极端 的 分 割 对 组 织 至 关 重 要 吗 ? 它 与 组 织 
战略 一 致 吗 ? 

15 连通 性 

Vincularity 这 个 词汇 源 于 拉丁 语 ， 意 思 是 连通 性 或 链接 。 这 个 概念 与 当今 的 互联 世 
界 密切 相关 。 连 接 不 同 信 息 集合 可 以 得 到 潜在 的 套利 价值 。 例 如 ， 政 府 一 直 尝 试 把 主要 


支出 的 细节 相连 接 ， 并 将 其 与 收入 报税 单 相关 联 以 发 现 是 否 隐 瞒 收入 。 而 这 一 目的 ， 现 
在 可 以 通过 从 社交 媒体 的 帖子 上 提取 信息 来 实现 。 


3. 一 个 大 数据 的 示例 


大 数据 的 一 个 例子 可 能 是 PB 级 数据 (1024 兆 兆 字 节 ) 或 艾 字 节 (1024 千 兆 兆 字 节 )， 
它 包 含 了 数 百 万 人 数 十 亿 的 记录 一 一 来 自 不 同 信息 源 (如 网 络 、 销 售 、 客 户 联络 中 心 、 
社交 媒体 及 移动 数据 等 )。 该 数据 通常 结构 性 不 强 , 而 且 往往 是 不 完整 的 和 难以 访问 的 。 


Text A 


Computer Software 


Computer software, consisting of programs, enables a computer to perform specific tasks. 
It is opposed to its physical components (hardware) which can only do the tasks they are 
mechanically designed for. The term includes application software such as word processors, 
which perform productive tasks for users, system software such as operating systems, which 
interface with hardware to run the necessary services for user-interfaces and applications, and 
middleware, which controls and coordinates distributed systems. 


1. Terminology 


The term “software” is an instruction-procedural programming source for scheduling 
instruction streams according to the von Neumann machine paradigm. It should not be 
confused with Configware and Flowware, which are programming sources for configuring the 
resources (structural “programming” by Configware) and for scheduling the data streams 
(data-procedural programming by Flowware) of the Anti machine paradigm of 
Reconfigurable Computing systems. 


2. Relationship to Computer Hardware 


Computer software is so called in contrast to computer hardware, which encompasses the 
physical interconnections and devices required to store and execute (or run) the software. In 
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computers, software is loaded into RAM and executed in the central processing unit. At the 
lowest level, software consists of a machine language specific to an individual processor. A 
machine language consists of groups of binary values signifying processor instructions (object 
codes), which change the state of the computer from its preceding state. Software is an 
ordered sequence of instructions for changing the state of the computer hardware in a 
particular sequence. It is usually written in high-level programming languages that are easier 
and more efficient for humans to use (closer to natural language) than machine language. 
High-level languages are compiled or interpreted into machine language object code. 
Software may also be written in an assembly language, essentially, a mnemonic 
representation of a machine language using a natural language alphabet. Assembly language 
must be assembled into object code via an assembler. 

In computer science and software engineering, computer software is all computer 
programs. The concept of reading different sequences of instructions into the memory of a 
device to control computations was invented by Charles Babbage as part of his difference 
engine. 


3. Types 


Practical computer systems divide software systems into three major classes: system 
software, programming software and application software, although the distinction is arbitrary, 
and often blurred. 


3.1 System Software 


System software helps run the computer hardware and computer system. It includes 
operating systems, device drivers, diagnostic tools, servers, windowing systems, utilities and 
more. The purpose of systems software is to insulate the applications programmer as much as 
possible from the details of the particular computer complex being used, especially memory 
and other hardware features, and such accessory devices as communications, printers, readers, 
displays, keyboards, etc. 


3.2 Programming Software 


Programming software usually provides tools to assist a programmer in writing computer 
programs and software using different programming languages in a more convenient way. 
The tools include text editors, compilers, interpreters, linkers, debuggers, and so on. An 
integrated development environment (IDE) merges those tools into a software bundle, and a 
programmer may not need to type multiple commands for compiling, interpreting, debugging, 
tracing, and etc., because the IDE usually has an advanced graphical user interface, or GUI. 
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3.3 Application Software 


Application software allows end users to accomplish one or more specific (non-computer 
related) tasks. Typical applications include industrial automation, business software, 
educational software, medical software, databases, and computer games. Businesses are 
probably the biggest users of application software, but almost every field of human activity 
now uses some form of application software. It is used to automate all sorts of functions. 


4. Three Layers 


Users often see things differently than programmers. People who use modem general 
purpose computers (as opposed to embedded systems, analog computers, supercomputers, etc.) 
usually see three layers of software performing a variety of tasks: platform, application, and 
user software. 


4.1 Platform Software 


Platform includes the firmware, device drivers, an operating system, and typically a 
graphical user interface which, in total, allows a user to interact with the computer and its 
peripherals (associated equipment). Platform software often comes bundled with the computer, 
and users may not realize that it exists or that they have a choice to use different platform 
software. 


4.2 Application Software 


Application software or Applications are what most people think of when they think of 
software. Typical examples include office suites and video games. Application software is 
often purchased separately from computer hardware. Sometimes applications are bundled 
with the computer, but that does not change the fact that they run as independent applications. 
Applications are almost always independent programs from the operating system, though they 
are often tailored for specific platforms. Most users think of compilers, databases, and other 
"system software" as applications. 


4.3 User Software 


User software tailors systems to meet the users specific needs. User software include 
spreadsheet templates, word processor macros, scientific simulations, and scripts for graphics 
and animations. Even email filters are a kind of user software. Users create this software 
themselves and often overlook how important it is. Depending on how competently the 
user-written software has been integrated into purchased application packages, many users 
may not be aware of the distinction between the purchased packages, and what has been 
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added by fellow co-workers. 


5. Operation 


Computer software has to be “loaded” into the computer’s storage (such as a hard drive, 
memory, or RAM). Once the software is loaded, the computer is able to execute the software. 
Computers operate by executing the computer program. This involves passing instructions 
from the application software, through the system software, to the hardware which ultimately 
receives the instruction as machine code. Each instruction causes the computer to carry out an 
operation — moving data, carrying out a computation, or altering the control flow of 
instructions. 

Data movement is typically from one place in memory to another. Sometimes it involves 
moving data between memory and registers which enable high-speed data access in the CPU. 
Moving data, especially large amounts of it, can be costly. So, this is sometimes avoided by 
using “pointers” to data instead. Computations include simple operations such as incrementing 
the value of a variable data element. More complex computations may involve many 
operations and data elements together. 

Instructions may be performed sequentially, conditionally, or iteratively. Sequential 
instructions are those operations that are performed one after another. Conditional instructions 
are performed such that different sets of instructions execute depending on the value(s) of 
some data. In some languages this is known as an “if” statement. Iterative instructions are 
performed repetitively and may depend on some data value. This is sometimes called a“ loop.” 
Often, one instruction may “call” another set of instructions that are defined in some other 
program or module. When more than one computer processor is used, instructions may be 
executed simultaneously. 

A simple example of the way software operates is what happens when a user selects an 
entry such as “Copy” from a menu. In this case, a conditional instruction is executed to copy 
text from data ina “document” area residing in memory, perhaps to an intermediate storage 
area known asa “clipboard” data area. If a different menu entry such as “Paste” is chosen, 
the software may execute the instructions to copy the text from the clipboard data area to a 
specific location in the same or another document in memory. 

Depending on the application, even the example above could become complicated. The 
field of software engineering endeavors to manage the complexity of how software operates. 
This is especially true for software that operates in the context of a large or powerful 
computer system. 

Kinds of software by operation: computer program as executable, source code or script, 
configuration. 
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6. Quality and reliability 


Software reliability considers the errors, faults, and failures related to the creation and 
operation of software. 

Software quality is very important, especially for commercial and system software 
like Microsoft Office, Microsoft Windows and Linux. If software is faulty (buggy), it can 
delete a person’s work, crash the computer and do other unexpected things. Faults and errors 
are called “bugs” which are often discovered during alpha and beta testing. Software is often 
also a victim to what is known as software aging, the progressive performance degradation 
resulting from a combination of unseen bugs. 

Many bugs are discovered and eliminated (debugged) through software testing. However, 
software testing rarely—if ever—eliminates every bug; some programmers say that “every 
program has at least one more bug” . In the waterfall method of software development, 
separate testing teams are typically employed, but in newer approaches, collectively 
termed agile software development, developers often do all their own testing, and demonstrate 
the software to users/clients regularly to obtain feedback. Software can be tested through unit 
testing, regression testing and other methods, which are done manually, or most commonly, 
automatically, since the amount of code to be tested can be quite large. For instance, 
NASA has extremely rigorous software testing procedures for many operating systems and 
communication functions. Many NASA-based operations interact and identify each other 
through command programs. This enables many people who work at NASA to check and 
evaluate functional systems overall. Programs containing command software enable hardware 
engineering and system operations to function much easier together. 


X New Words 
mechanically [mi'kaenikeli] adv. 机 械 地 
middleware ['midlwea] 17. 中 间 设 备 ， 中 间 件 
procedural [pre'si:dzerel] adj. #2 Fr E & 
paradigm [paeredaim] .范例 
structural [strAktJ eral] adj. 结 构 的 ， 结 构 化 
interconnection [inteke'nekf en] .互相 连接 
compile [kem'pail] VL. 编译 
assembler [9'sembla] n iL 4a f£ f 
arbitrary [‘a:bitreri] adj. KW, EBEN 
blur [bla:] vy 模糊 


insulate [insjuleit] Wt 隔离， 使 绝缘 


reader 
convenient 
interpreter 
linker 
debugger 
merge 
bundle 


tailored 
template 
macro 

script 
animation 
filter 
competently 
co-worker 
alter 

costly 
pointer 
conditionally 
iteratively 
call 
clipboard 
endeavor 
reliability 


XWA Phrases 


be opposed to 
system software 
distributed system 
be confused with 
in contrast to 
machine language 
object code 
ordered sequence 


[ri:de] 

[ken vi:njent] 
[inte:prita] 
[linke] 
[di:'baga] 
[me:d3] 
[bandi] 


[teiləd] 
[templit] 
[maekreu] 
[skript] 
Leeni'meif en] 
[filte] 
[kompitentli] 
[keu'we:ke] 
[o:lta] 
[kostli] 
[pointe] 
[ken'dif nali] 
[iteretivli] 
[ko:l] 
[klipbo:d] 
[in'deva] 
[rilaie'biliti] 


high-level programming languages 


n. 读 卡 机 

adj. 便 利 的 ， 方 便 的 
.解释 程序 

n. (目标 代码 ) 连 接 器 
7 调试 器 
vež, HA, RA 
ndB, X, & 

DX En 
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adj. 定 做 的 ,特制 的 ， 专门 的 


nn. 模 板 (=templet) 

[EA 

nK 

n.5) E 

DE P METTE 
adv. 胜 任 地 ， 适 合 地 
1. 合 作者， 同事， 帮手 
v% 


ad. hin, AA; 造成 损失 的 


.指针 
adv. 有 条 件 地 
adv. 反 复 地 ; 迭代 地 
.及 v. 调 用 

n.Jj WR 
n.&vi.RA, ZA 
.可靠 性 


系统 软件 
分 布 式 的 计算 机 系统 
混淆 


结果 代码 
有 序 序列 
高 级 编程 语言 


形成 对 昭 
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natural language 
assembly language 
software engineering 
difference engine 
divide ... into ... 
device driver 
diagnostic tool 

as much as possible 
computer complex 
text editor 
integrated development environment (IDE) 
computer game 

all sorts of 
embedded system 
analog computer 

a variety of 
platform software 
in total 

come with 

video game 
separate from 

be integrated into ... 
be aware of 

be able to 

carry out 

one after another 

be known as 
conditional instruction 
be incapable of 
source code 
software reliability 


他 Notes 


设备 驱动 程序 

诊断 工具 

尽 可 能 

计算 装置 

文本 编辑 器 

集成 开发 环境 
计算 机 游戏 程序 
各 种 各 样 的 

KARRA 

模拟 计算 机 

多 种 的 

平台 软件 

整个 地 (=as a whole) 

伴随 ……- ARE. Bee 一 起 供给 
计算 机 视频 游戏 ， 电 视 游戏 
分 离 ， 分 开 


条 件 指令 

不 能 

HRD, HRG, HEF 
软件 可 靠 性 


[1] It is opposed to its physical components (hardware) which can only do the tasks they are 


mechanically designed for. 


本 句 中 , which can only do the tasks they are mechanically designed for 是 一 个 定语 从 句 ， 
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修饰 和 限定 its physical components。 在 该 从 句 中 ，they are mechanically designed for 
也 是 一 个 定语 从 句 ， 修 饰 和 限定 the tasks。(hardware) 是 对 its physical components 的 
补充 说 明 。 

[2] The term includes application software such as word processors, which perform 

productive tasks for users, system software such as operating systems, which interface 
with hardware to run the necessary services for user-interfaces and applications, and 
middleware, which controls and coordinates distributed systems. 
本 名 中 , which perform productive tasks for users 是 一 个 非 限定 性 定语 从 句 , 修饰 word 
processors; which interface with hardware to run the necessary services for user-interfaces 
and applications 是 一 个 非 限定 性 定语 从 句 , 修饰 operating systems; which controls and 
coordinates distributed systems 也 是 一 个 非 限定 性 定语 从 句 ， 修 饰 middleware. such as 
的 意思 是 “例如 ”， 用 来 举例 说 明 。 

[3] It should not be confused with Configware and Flowware, which are programming 
sources for configuring the resources (structural “programming” by Configware) and for 
scheduling the data streams (data-procedural programming by Flowware) of the Anti 


machine paradigm of Reconfigurable Computing systems. 
ÆJ, which are programming sources for configuring the resources (structural 
“programming” by Configware) and for scheduling the data streams (data-procedural 
programming by Flowware) of the Anti machine paradigm of Reconfigurable Computing 
systems 是 一 个 非 限定 性 定语 从 句 ， 对 Configware and Flowware 进行 补充 说 明 。 

[4] Computer software is so called in contrast to computer hardware, which encompasses the 
physical interconnections and devices required to store and execute (or run) the software. 
本 句 中 ，which encompasses the physical interconnections and devices required to store 
and execute (or run) the software 是 一 个 非 限 定性 定语 从 多， 对 computer hardware 进 
行 补充 说 明 。required to store and execute (or run) the software 是 一 个 过 去 分 词 短 语 ， 
作 定 语 , 修饰 和 限定 the physical interconnections and devices. in contrast to 的 意思 是 
“与 …… 形 成 对 比 ”，“ 相 比 之 下 ”。 

[5] A machine language consists of groups of binary values signifying processor instructions 
(object codes), which change the state of the computer from its preceding state. 
本 句 中 ，signifying processor instructions (object code) 是 一 个 现在 分 词 短语 ， 作 定语 ， 
修饰 和 限定 binary values。which change the state of the computer from its preceding 
state 是 一 个 非 限 定性 定语 从 句 ， 对 processor instructions 进行 补充 说 明 。 


XWA Exercises 


【Ex. 1】 根 据 课文 内 容 ， 回 答 以 下 问题 。 
(1) What does computer software consist of? What does it do? 
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(2) What does the term computer software include? 

(3) What does a machine language consist of? 

(4) What are the three major classes practical computer systems divide software systems into? 

(5) What does system software do? What does it include? 

(6) What does programming software usually do? 

(7) What does application software usually do? What do typical applications include? 

(8) What are the three layers of software? 

(9) What are the sequential instructions, conditional instructions and iterative instructions 
respectively? 

(10) What can happen if software is faulty (buggy)? 


【Ex. 2】 英 汉 互 译 
1 device driver 

2 middleware 

3. diagnostic tool 

4 Integrated development environment 
5 pointer 

6 配置 ， 设 定 
J 嵌入 式 系统 
8. 模块 

9. 汇编 程序 
10. 编译 器 


ee SE: X Qvo m ee tb 


s 


【Ex. 3】 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


application computer create enable transferring 
programs software user basics processing 


System software is closely related to, but distinct from Operating System software. It is 
any computer software that provides the infrastructure over which _ (1) _ can operate, i.e. it 
manages and controls computer hardware so that _ (2) — software can perform. Operating 
systems, such as GNU, Microsoft Windows, Mac OS X or Linux, are prominent examples of 
system _ (3) . 

System software is software that basically allows the parts of a _ (4) to work 
together. Without the system software the computer cannot operate as a single unit. In 
contrast to system software, software that allows you to do things like — (5) _ text 
documents, play games, listen to music, or surf the web. 

In general, application programs are software that _ (6) the end-user to perform 


| Unit 2 (27) 


specific, productive tasks, such as word _(7) or image manipulation. System software 
performs tasks like _ (8) data from memory to disk, or rendering text onto a display 
device. 

System software is not generally what a user would buy a computer for, instead, it is 
usually the — (9) — ofa computer which come built-in. Application software is the programs 
on the computer when the — (10) — buys it. These may include word processors and web 
browsers. 


【Ex. 4】 把 下 列 短文 翻译 成 中 文 。 

Computer programs (also software programs, or just programs) are instructions for a 
computer. A computer requires programs to function, typically executing the program’s 
instructions in a central processor. The program has an executable form that the computer can 
use directly to execute the instructions. 

Computer source code is often written by professional computer programmers. Source 
code may be converted into an executable file (sometimes called an executable program or a 
binary) by a compiler. Alternatively, computer programs may be executed by a central 
processing unit with the aid of an interpreter, or may be embedded directly into hardware. 

Computer programs may be categorized along functional lines: system software and 
application software. And many computer programs may run simultaneously on a single 
computer, a process known as multitasking. 


Text B 


Software Development Process 


Software development process is a structure imposed on the development of a software 
product. Synonyms include software life cycle and software process. There are several models 
for such processes, each describing approaches to a variety of tasks or activities that take 
place during the process. 


1. Processes and meta-processes 


A growing body of software development organizations implement process methodologies. 

Many of them are in the defense industry, which in the U.S. requires a rating based on 

‘process models’ to obtain contracts. The international standard for describing the method 
of selecting, implementing and monitoring the life cycle for software is ISO 12207. 
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The Capability Maturity Model (CMM) is one of the leading models. Independent 
assessments grade organizations on how well they follow their defined processes, not on the 
quality of those processes or the software produced. CMM is gradually replaced by CMMI. 
ISO 9000 describes standards for formally organizing processes with documentation. 

ISO 15504, also known as Software Process Improvement Capability Determination 
(SPICE), isa "framework for the assessment of software processes” . This standard is aimed 
at setting out a clear model for process comparison. SPICE is used much like CMM and 
CMMI. It models processes to manage, control, guide and monitor software development. 
This model is then used to measure what a development organization or project team actually 
does during software development. This information is analyzed to identify weaknesses and 
drive improvement. It also identifies strengths that can be continued or integrated into 
common practice for that organization or team. 

Six Sigma is a methodology to manage process variations that uses data and statistical 
analysis to measure and improve a company’s operational performance. It works by 
identifying and eliminating defects in manufacturing and service-related processes. The 
maximum permissible defects is 3.4 per one million opportunities. However, Six Sigma is 
manufacturing-oriented and needs further research on its relevance to software development. 


11 Domain Analysis 


Often the first step in attempting to design a new piece of software, whether it be an 
addition to an existing software, a new application, a new subsystem or a whole new system, 
is, what is generally referred to as “Domain Analysis” . Assuming that the developers 
(including the analysts) are not sufficiently knowledgeable in the subject area of the new 
software, the first task is to investigate the so-called “domain” of the software. The more 
knowledgeable they are about the domain already, the less the work required. Another 
objective of this work is to make the analysts who will later try to elicit and gather the 
requirements from the area experts or professionals, speak with them in the domain’s own 
terminology and to better understand what is being said by these people. Otherwise they will 
not be taken seriously. So, this phase is an important prelude to extracting and gathering the 
requirements. The following quote captures the kind of situation an analyst who hasn’t done 
his homework well may face in speaking with a professional from the domain: “I know you 
believe you understood what you think I said, but I am not sure you realize what you heard is 
not what I meant.” 


1.2 Software Elements Analysis 


The most important task in creating a software product is extracting the requirements. 
Customers typically know what they want, but not what software should do, while incomplete, 
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ambiguous or contradictory requirements are recognized by skilled and experienced software 
engineers. Frequently demonstrating live code may help reduce the risk that the requirements 
are incorrect. 


1.3 Specification 


Specification is the task of precisely describing the software to be written, possibly in a 
rigorous way. In practice, most successful specifications are written to understand and 
fine-tune applications that were already well-developed, although safety-critical software 
systems are often carefully specified prior to application development. Specifications are most 
important for external interfaces that must remain stable. 


1.4 Software architecture 


The architecture of a software system refers to an abstract representation of that system. 
Architecture is concerned with making sure the software system will meet the requirements of 
the product, as well as ensuring that future requirements can be addressed. The architecture 
step also addresses interfaces between the software system and other software products, as 
well as the underlying hardware or the host operating system. 


1.5 Implementation (or coding) 


Reducing a design to code may be the most obvious part of the software engineering job, 
but it is not necessarily the largest portion. 


1.6 Testing 


Testing of parts of software, especially where code by two different engineers must work 
together, falls to the software engineer. 


1.7 Documentation 


An important (and often overlooked) task is documenting the internal design of software 
for the purpose of future maintenance and enhancement. Documentation is most important for 
external interfaces. 


2. Software Training and Support 


A large percentage of software projects fail because the developers fail to realize that it 
doesn't matter how much time and planning a development team puts into creating software if 
nobody in an organization ends up using it. People are occasionally resistant to change and 
avoid venturing into an unfamiliar area so, as a part of the deployment phase, it is very 
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important to have training classes for the most enthusiastic software users (build excitement 
and confidence), shifting the training towards the neutral users intermixed with the avid 
supporters, and finally incorporate the rest of the organization into adopting the new software. 
Users will have lots of questions and software problems which leads to the next phase of 
software. 


3. Maintenance 


Maintaining and enhancing software to cope with newly discovered problems or new 
requirements can take far more time than the initial development of the software. Not only 
may it be necessary to add code that does not fit the original design but just determining how 
software works at some point after it is completed may require significant effort by a software 
engineer. About 2/3 of all software engineering work is maintenance, but this statistic can be 
misleading. A small part of that is fixing bugs. Most maintenance is extending systems to do 
new things, which in many ways can be considered new work. In comparison, about 2/3 of all 
civil engineering, architecture, and construction work is maintenance in a similar way. 


XW New Words 
process [pre'ses] n. 过 程 ; 作用 ; 方法 ， 程 序 ， 步 又 
Vt 加工， 处 理 
activity [ek'tiviti] .活动 ， 活 动 性 ; 行动 ， 行 为 
defense [di'fens] n.EE; 防卫 
contract [kontraekt] n. [8l 
assessment [a'sesmant] niiit, EH: 评估 ,评价 
grade [greid] nH, RF! 
VL 评分， 评级 
gradually ['graedjueli] adv% Ji 9, 
formally [fo:meli] adv. 正 式 地 ， 形 式 上 
team [ti:m] ni, Al 
actually ['ektfueli] adv. 实 际 上 ， 事 实 上 
weakness [wi:knis] n. E, RA 
methodology [me8e doledsi] nJri, HR 
permissible [pe'misibel] adj. 允 许 的 ， 承 认 的 
defect [difekt] nitk, RA 


opportunity Lope'tju:niti] 1. 机 会 ， 时 机 
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subsystem [‘sab.sistim] 17. 子 系统 

sufficiently [sefif entli] adv. 足 够 地 ， 充 分 地 

knowledgeable [nolidzebel] adj. 博 学 的 ; 有 见识 的 

investigate [in'vestigeit] v. 调 查 ， 研 究 

elicit [i'lisit] vA, IH, WH; 引起 

gather [gæðə] n&vi ke, RE 

phase [feiz] nht; 状态 

prelude [prelju:d] nR, WE. 序幕 

ambiguous [Lzem'bigjuas] adj. 不 明确 的 

contradictory [kontre'dikteri] adj. Bh; 反对 的 

skilled [skild] adj. 熟 练 的 

rigorous [rigeres] adj. XE B]; 精确 的 ， 一 丝 不 苟 的 

fine-tune [fain-tju:n] v. 调 整 ;使 有 规则 

obvious [obvias] adj. 明 显 的 ， 显 而 易 见 的 

documentation [dokjumen'teifen] nn. 文件 

overlook [euvaluk] vt. 没 注意 到 

enhancement [in'ha:nsment] neh, 促进 

occasionally [e keizaneli] adv. 有 时 候 ， 偶 尔 

Tesistant [rizistent] adr. 抵 抗 的 ， 反 抗 的 

venture [vent a] ne; 投机 ; 风险 
v. 

unfamiliar [‘anfa'milja] ad. HANH; 不 熟悉 的 ， 没 有 经 验 的 

deployment [diploimant] nn. 部 署 

enthusiastic [in.@ju:zi'zestik] adj. 热 心 的 ， 热 情 的 

excitement [ik'saitmant] nF; 兴奋 ， 激 动 

confidence [konfidans] 17. 信 心 

avid [sevid] adj. 渴 望 的 

incorporate ['inkoperit] vi.&3t; 混合 

misleading [mis'li:din] adf. 易 误解 的 ， 令 人 误解 的 

bug [bag] n. 故 障 ， 问 题 

XA Phrases 

impose on 利用 ; 施加 影响 于 


software life cycle 软件 生命 期 
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Capability Maturity Model (CMM) (软件 ) TES TEE! 
Software Process Improvement Capability 软件 过 程 改 进 能 力 测定 
Determination (SPICE) 

set out 表明 ; 展示 

Six Sigma 六 西格玛 

Domain Analysis 定义 域 分 析 

be concerned with 
for the purpose of 
put into 

end up 


cope with 


XWA Abbreviations 


CMMI (Capability Maturity Model Integration) 能 力 成 熟 度 集成 模型 


XW Exercises 


【Ex. 5】 根 据 文章 所 提供 的 信息 填空 。 

1. Software development process is a structure imposed on y 

2. ISO 15504, also known as , is a "framework for the assessment of 
software processes" . This standard is aimed at x 

3. Six Sigma is a methodology that uses data and statistical analysis 


4. Often the first step in attempting to design a new piece of software is 
5. The most important task in creating a software product is 

6. Specification is the task of " 
7. Architecture is concerned with , as well as 


ensuring i 


8. Testing of parts of software, especially where code by two different engineers must work 
together, falls to ; 
9. A large percentage of software projects fail because the developers fail to realize 


10. Most maintenance is extending systems to do new things, which in many ways can be 


considered 
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计算 机 软件 


计算 机 软件 由 程序 组 成 ， 可 以 让 计算 机 执行 特定 的 任务 。 它 与 只 能 机 械 地 执行 设 定 
任务 的 物理 构件 (硬件) 相对。 这 个 术语 包括 应 用 程序 (如 能 够 提高 用 户 工作 效率 的 字 
处 理 器 )、 系 统 软件 〈 如 操作 系统 ， 它 带 有 硬件 接口 ， 以 便 为 用 户 界 面 和 应 用 程序 提供 
必需 的 服务 ) 和 中 间 件 〈 管 理 与 适应 分 布 系统 )。 


1. 术语 


术语 “软件 ”是 一 个 指令 序列 的 程序 源 ， 它 按照 冯 。 诺 依 曼 机 制 制 定 指令 流 ， 不 应 
该 把 它 与 配置 件 和 流 件 混淆 。 配 置 件 和 流 件 都 是 用 来 配置 资源 的 程序 源 (通过 配置 件 实 
现 结构 化 “编程 >， 制定 数据 流 《〈 使 用 流 件 实现 数据 流程 编程 )， 是 重 配置 计算 机 系统 
的 反 冯 “。 诺 依 曼 机 制 的 范例 。 


2. 与 计算 机 硬件 的 关系 


计算 机 软件 是 与 计算 机 硬件 相对 的 称谓 , 硬件 包括 物理 连接 和 存储 与 执行 (或 运行 ) 
软件 所 需 的 设备 。 在 计算 机 中 ， 软 件 装 入 RAM 并 在 中 央 处 理 器 中 执行 。 最 基本 的 软件 
可 以 由 特定 处 理 器 的 机 器 语言 组 成 。 机 器 语言 由 一 组 表示 处 理 器 指令 (目标 代码 ) 的 二 
进 制 值 组 成 ， 这 些 目标 代码 可 以 改变 计算 机 的 状态 。 软 件 是 有 序 的 指令 序列 ， 以 特定 序 
列 改变 计算 机 硬件 的 状态 。 它 通常 用 高 级 语言 编写 ， 对 人 来 说 比 机 器 语言 更 便于 理解 且 
更 有 效 〈 更 接近 自然 语言 )。 高 级 语言 可 以 编译 或 解释 成 机 器 语言 目标 代码 。 软 件 也 可 
以 用 汇编 语言 编写 ， 汇 编 语言 本 质 上 是 用 自然 语言 字母 表示 的 机 器 语言 助 记 形式 。 汇 编 
语言 必须 通过 编译 器 编译 为 目标 代码 。 

在 计算 机 科学 和 软件 工程 中 ， 所 有 的 计算 机 程序 都 是 计算 机 软件 。 把 不 同 的 指令 序 
列 读 到 设备 的 内 存 以 控制 计算 这 一 概念 是 由 查尔斯 。 巴 贝 奇 提出 的 ， 这 成 为 其 差分 机 的 
一 部 分 。 


3. 类 型 


实际 的 计算 机 系统 把 软件 分 为 三 大 类 : 系统 软件 、 编 程 软件 和 应 用 软件 ， 尽 管 其 差 
别 是 武断 的 ， 并 且 经 常 混淆 。 
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3.1 系统 软件 


系统 软件 帮助 运行 计算 机 硬件 和 计算 机 系统 。 它 包括 操作 系统 、 设 备 驱 动 程序 、 诊 
断 工具 、 服 务 程序 、 窗 口 系统 、 实 用 程序 等 多 种 。 系 统 软件 的 目的 是 把 应 用 程序 员 与 所 
用 的 复杂 计算 机 的 细节 尽 可 能 隔离 开 来 , 尤其 是 与 内 存 和 其 他 硬件 、 附 件 ( 如 通信 设备 、 
打印 机 、 阅 读 设备 、 显 示 器 、 键 盘 等 ) 隔 开 。 


32 ”编程 软件 


编程 软件 通常 提供 帮助 程序 员 用 不 同 的 编程 语言 更 方便 地 编写 计算 机 程序 和 软件 
的 工具 。 这 些 工 具 包 括 文本 编辑 器 、 编 译 器 、 解 释 程 序 、 连 接 程序 、 调 试 程序 等 。 集 成 
开发 环境 把 这 些 工具 合并 为 一 个 软件 包 ， 程 序 员 不 用 给 编译 、 解 释 、 调 试 、 跟 踪 等 操作 
输入 多 个 命令 ， 因 为 IDE 通常 有 高 级 的 图 形 用 户 界面 或 GUI。 
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应 用 软件 允许 终端 用 户 实现 一 个 或 多 个 (与 计算 机 无 关 的 ) 特定 任务 。 典 型 的 应 用 
包括 工业 自动 控制 、 商 业 软 件 、 教 育 软 件 、 医 学 软件 、 数 据 库 和 计算 机 游戏 。 商 业 大 概 
是 应 用 软件 的 最 大 用 户 ,但 几乎 人 类 活动 的 每 个 领域 现在 都 在 使 用 某 种 应 用 软件 。 它 用 
于 各 种 各 样 的 自动 操作 。 


4. 三 层 


用 户 看 待 事情 的 方法 往往 与 程序 员 不 同 。 使 用 现代 化 普通 计算 机 (与 嵌入 式 计算 机 、 
模拟 计算 机 、 超 级 计算 机 等 不 同 ) 的 人 往往 认为 执行 各 种 操作 的 软件 有 三 个 层次 : 平台 
软件 、 应 用 软件 和 用 户 软件 。 


4.1 平台 软件 


台 软 件 包 括 固件 、 设 备 驱 动 程序 、 操 作 系统 以 及 有 代表 性 的 图 形 用 户 界面 。 总 体 
上 说 ， 图 形 用 户 界 面 让 用 户 与 计算 机 及 外 设 ( 相 关 设 备 ) 交互 。 平 台 软件 通常 与 计算 机 
捆绑 提供 ， 用 户 可 能 没有 意识 到 它 的 存在 或 者 不 知道 他 们 可 以 选择 其 他 平台 软件 。 
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应 用 软件 或 应 用 就 是 大 多 数 人 认为 的 软件 。 典 型 的 例子 包括 办 公 套 件 和 视频 游戏 。 
应 用 软件 通常 与 计算 机 硬件 分 开 购买 。 有 时 应 用 软件 也 与 计算 机 捆绑 ， 但 这 不 能 改变 它 
们 作为 独立 应 用 软件 而 运行 的 事实 。 应 用 软件 几乎 总 是 独立 于 操作 系统 的 程序 ， 尽 管 它 
们 通常 为 特定 的 平台 而 制作 。 大 部 分 用 户 把 编译 程序 、 数 据 库 和 其 他 “系统 软件 ” 当 作 
应 用 软件 。 


l Unit 2 (35) 
AO 


43 用 户 软件 


用 户 软件 定制 多 个 系统 以 便 满 足 用 户 的 特定 需求 。 用 户 软件 包括 电子 表格 模板 、 字 
处 理 程序 的 宏 、 科 学 仿真 及 用 于 图 形 和 动画 的 脚本 。 甚 至 电子 邮件 过 滤器 也 是 用 户 软件 
的 一 种 。 用 户 自己 建立 用 户 软件 ， 且 通常 忽视 它 的 重要 性 。 由 于 用 户 编写 软件 根据 其 适 
应 性 被 整合 到 所 购买 的 应 用 软件 包 中 ， 因 而 许多 用 户 不 知道 所 购买 的 软件 包 的 差别 ， 也 
不 知道 合作 伙伴 在 里 面 加 了 什么 。 
5. 运行 

计算 机 软件 必须 被 “装载 ”到 计算 机 的 存储 器 (如 硬盘 、 存 储 器 或 RAM) He 一 
旦 软件 被 装 入 ， 计算机 就 可 以 执行 该 软件 。 计 算 机 通过 执行 程序 来 运行 。 这 包括 从 应 用 
软件 提取 指令 、 经 过 系统 软件 发 给 最 终 以 机 器 代码 接收 指令 的 硬件 。 每 个 指令 都 使 计算 
机 执行 一 个 操作 一 一 移动 数据 、 执 行 计算 或 改变 指令 的 控制 流 。 

数据 移动 通常 是 数据 从 内 存 中 的 一 个 位 置 向 另 一 位 置 移动 。 有 时 数据 也 在 内 存 和 寄 
存 器 之 间 移 动 ， 寄 存 器 可 以 实现 在 CPU 中 高 速 访问 数据 。 移 动 数据 一 一 特别 是 移动 大 
量 的 数据 一 一 是 花费 成 本 的 。 所 以 ， 有 时 使 用 “指针 ”来 代替 数据 。 计 算 包 括 简单 的 运 
算 ， 如 增加 一 个 可 变数 据 元 素 的 值 。 更 复杂 的 计算 也 许 涉及 许多 运算 和 数据 元 素 。 

指令 可 以 被 连续 地 、 有 条 件 地 或 循环 地 执行 。 连 续 指 令 是 一 个 接 一 个 执行 的 操作 。 
条 件 指令 是 根据 某 些 数据 的 值 执行 不 同 的 指令 集合 。 在 某 些 语言 中 ， 叫 作 if 语句 。 循 
环 指 令 是 根据 某 些 数值 并 反复 地 执行 。 这 有 时 叫 作 一 个 “循环 ”。 通常 ， 一 个 指令 可 以 
调用 另 一 个 在 其 他 程序 或 模块 中 定义 的 指令 集合 。 当 使 用 多 个 处 理 器 时 ， 指 令 可 以 同 
步 执行 。 

这 种 软件 运行 方式 的 一 个 简单 例子 是 , 用 户 从 一 个 菜单 中 选择 一 个 菜单 项 (如 Copy? 
后 所 发 生 的 一 切 。 在 这 种 情况 下 ， 条 件 指令 被 执行 以 便 从 内 存 驻 留 的 文本 区 域 的 数据 中 
复制 一 个 文本 到 叫 作 “ 剪 切 板 ”的 一 个 临时 存储 区 域 。 如 果 另 一 菜单 项 (如 Paste) 被 
选择 ， 软 件 可 以 执行 该 指令 ， 把 “ 剪 切 板 ”数据 区 域 中 的 文本 复制 到 内 存 中 同一 文本 或 
不 同文 本 的 特定 位 置 。 

根据 应 用 软件 ， 以 上 这 个 例子 也 可 以 变 得 复杂 。 软 件 工程 致力 于 管理 软件 运行 的 复 
杂 性 。 对 于 运行 在 大 的 或 功能 强 的 计算 机 系统 的 软件 而 言 ， 尤 其 如 此 。 

按照 运行 软件 分 为 以 下 几 种 : 可 运行 的 计算 机 程序 、 源 代码 或 脚本 、 配 置 程序 。 


6. 软件 的 质量 和 可 靠 性 


软件 可 靠 性 考虑 与 软件 建立 和 运行 相关 的 错误 、 故 障 及 失效 。 

软件 质量 非常 重要 ,尤其 是 像 Microsoft Office, Microsoft Windows 和 Linux 这 样 的 
商业 和 系统 软件 。 如 果 软 件 出 现 故障 〈 出 错 ) ， 它 可 以 删除 一 个 人 的 工作 ， 使 计算 机 出 
省 和 做 出 其 他 意 想 不 到 的 事情 。 故 障 和 错误 被 称 为 “bug (漏洞 )》 ”， 这 是 alpha 和 beta 
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测试 过 程 中 经 常 出 现 的 。 软 件 通 常 也 是 所 谓 的 软件 老化 的 受害 者 ， 这 源 于 看 不 见 的 错误 
组 合 而 产生 渐进 的 性 能 下 降 。 

通过 软件 测试 可 以 发 现 和 消除 (调试 ) 许多 错误 。 然 而 ， 软 件 测试 (如 果 有 的 话 ) 
很 少 能 够 消除 所 有 的 错误 ， 有 些 程序 员 说 ，“ 每 一 个 程序 至 少 都 有 一 个 错误 ”。 在 软件 
开发 的 瀑布 方法 中 ， 通 常 使 用 独立 的 测试 团队 ， 但 在 较 新 的 方法 中 ， 统 称 为 敏捷 软件 开 
发 ， 开 发 者 经 常 亲自 做 所 有 的 测试 ， 并 定期 向 用 户 / 客 户 展示 该 软件 以 获得 反馈 。 软 件 可 
以 通过 单元 测试 、 回 归 测试 等 方法 进行 测试 ， 可 以 手工 完成 ， 因 为 要 测试 的 代码 量 可 能 
相当 大 ， 所 以 最 常见 的 是 自动 进行 测试 。 例 如 ， 美 国航 空 航天 局 (NASA) 拥有 许多 极 
为 严格 的 操作 系统 和 通信 功能 的 软件 测试 程序 .许多 基于 NASA 的 操作 通过 命令 程序 交 
互 ， 相 互 识 别 。 这 使 很 多 在 NASA 工作 的 人 能 够 检查 和 评估 系统 的 整体 功能 。 包 含 命令 
软件 的 程序 使 硬件 工程 和 系统 操作 能 够 更 容易 地 共同 发 挥 其 功能 。 


Text A 


Operating System 


1. What is an operating system? 


An operating system is the core software component of your computer. It performs many 
functions and is, in very basic terms, an interface between your computer and the outside 
world. In the section about hardware, a computer is described as consisting of several 
component parts including your monitor, keyboard, mouse, and other parts. The operating 
system provides an interface to these parts using what is referred to as “drivers” . This is why 
sometimes when you install a new printer or other piece of hardware, your system will ask 
you to install more software called a driver. 


2. What does a driver do? 


A driver is a specially written program which understands the operation of the device it 
interfaces to, such as a printer, video card, sound card or CD-ROM drive. It translates 
commands from the operating system or user into commands understood by the component 
part it interfaces with. It also translates responses from the component part back to responses 
that can be understood by the operating system, application program, or user. 


38 


大 数据 专业 英语 教程 


3. Other operating system functions 


The operating system provides for several other functions including: 
* System tools (programs) used to monitor computer performance, debug problems, or 
maintain parts of the system. 
* A set of libraries or functions which programs may use to perform specific tasks 
especially relating to interfacing with computer system components. 
The operating system makes these interfacing functions (see Figure 3-1) along with its 
other functions operate smoothly and these functions are mostly transparent to the user. 
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Mouse 
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Keyboard 
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Figure 3-1 Operating System Interfaces 


4. Operating system concerns 


As mentioned previously, an operating system is a computer program. Operating systems 
are written by human programmers who can make mistakes. Therefore, there can be errors in 
the code even though there may be some testing before the product is released. Some 
companies have better software quality control and testing than others, so you may notice 
varying levels of quality from operating system to operating system. Errors in operating 
systems cause three main types of problems: 

e System crashes and instabilities 一 These can happen due to a software bug typically 
in the operating system, although computer programs being run on the operating 
system can make the system more unstable or may even crash the system by 
themselves. This varies depending on the type of operating system. A system crash is 
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the act of a system freezing and becoming unresponsive which would cause the user 
to need to reboot. 

e Security flaws 一 Some software errors leave a door open for the system to be broken 
into by unauthorized intruders. As these flaws are discovered, unauthorized intruders 
may try to use these to gain illegal access to your system. Patching these flaws often 
will help keep your computer system secure. 

* Malfunctions — Sometimes errors in the operating system will cause the computer 
not to work correctly with some peripheral devices such as printers. 


5. Operating system types 


Let us look at the different types of operating systems and know how they differ from 
one another. 

* Real-time operating system 

It is a multitasking operating system that aims at executing real-time applications. 
Real-time operating systems often use specialized scheduling algorithms so that they can 
achieve a deterministic nature of behavior. The main object of real-time operating systems is 
their quick and predictable response to events. They either have an event-driven design or a 
time-sharing one. An event-driven system switches between tasks based on their priorities 
while time-sharing operating systems switch tasks based on clock interrupts. 

* Multi-user and single-user operating systems 

Multi-user computer operating systems allow multiple users to access a computer system 
simultaneously. Time-sharing systems can be classified as multi-user systems as they enable a 
multiple user access to a computer through time sharing. Single-user operating systems, as 
opposed to a multi-user operating system, are usable by only one user at a time. Being able to 
have multiple accounts on a Windows operating system does not make it a multi-user system. 
Rather, only the network administrator is the real user. But for a Unix-like operating system, it 
is possible for two users to log in at a time and this capability of the OS makes it a multi-user 
operating system. 

* Multi-tasking and single-tasking operating systems 

When a single program is allowed to run at a time, the system is grouped under the 
single-tasking system category, while in case the operating system allows for execution of 
multiple tasks at a time, it is classified as a multi-tasking operating system. Multi-tasking can 
be of two types, namely preemptive and cooperative. In preemptive multitasking, the 
operating system slices the CPU time and dedicates one slot to each of the programs. 
Unix-like operating systems such as Solaris and Linux support preemptive multitasking. If 
you are aware of the multithreading terminology, you can consider this type of multi-tasking 
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as similar to interleaved multithreading. Cooperative multitasking is achieved by relying on 
each process to give time to the other processes in a defined manner. This kind of multitasking 
is similar to the idea of block multithreading in which one thread runs till it is blocked by 
some other event. 

Distributed operating system 

An operating system that manages a group of independent computers and makes them 
appear to be a single computer is known as a distributed operating system. The development 
of networked computers that could be linked and made to communicate with each other gave 
rise to distributed computing. Distributed computations are carried out on more than one 
machine. When computers in a group work in cooperation, they make a distributed system. 

* Embedded operating system 

The operating systems designed for being used in embedded computer systems are 
known as embedded operating systems. They are designed to operate on small machines like 
PDAs with less autonomy. They are able to operate with a limited number of resources. They 
are very compact and extremely efficient by design. 

* Mobile operating system 

Though not a functionally distinct kind of operating system, mobile OS is definitely an 
important mention in the list of operating system types. A mobile OS controls a mobile device 
and its design supports wireless communication and mobile applications. It has built-in 
support for mobile multimedia formats. Tablet PCs and smartphones run on mobile operating 
systems. 

* Batch processing and interactive systems 

Batch processing refers to execution of computer programs in “batches” without manual 
intervention. In batch processing systems, programs are collected, grouped and processed on a 
later date. There is no prompting the user for inputs as input data are collected in advance for 
future processing. Input data are collected and processed in batches, hence the name batch 
processing. IBM's z/OS has batch processing capabilities. As against this, interactive 
operating requires user intervention. The process cannot be executed in the user's absence. 

* online and offline processing systems 

In online processing of data, the user remains in contact with the computer and processes 
are executed under control of the computer's central processing unit. When processes are not 
executed under direct control of the CPU, the processing is referred to as offline. Let's take 
the example of batch processing. Here, the batching or grouping of data can be done without 
user and CPU intervention; it can be done offline. But the actual process execution may 
happen under direct control of the processor, that is online. 
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XWA New Words 

core [ko:] nA 
install [in'sto:l] wkk, E 
translate [treens'leit] vt 翻译， 解释 ， 转 化 
response [ris'pons] 1. 回 答 ， 响 应， 反应 
smoothly ['smu:ðli] adv. 平 稳 地 
transparent [treens'peerent] adj. 透 明 的 ， 显然 的 ， 明 晰 的 
mention [menfen] n&vith, RR 
mistake [mis'teik] nik, WK 

v. 弄 错 ， 误 解 
release [ri'lizs] vt. & n. i 
control [ken'treul] nn. 以 vi. 控制， 支配 
instability [instə'biliti] n. 不 稳固 ， 不 稳定 
unstable [an'steibl] adf. 不 牢固 的 ， 不 稳定 的 
freezing [fri:zin] adj. 冻 结 的 
unresponsive [Anrisponsiv] adf. 无 反应 的 ， 没 有 回答 的 
Teboot [riz bu:t] .重新 启动 
security [si'kjuariti] nn. 安全 
flaw [flo:] nth Wa; 裂痕 
unauthorized [An'o:8eraizd] adj. 未 经 授权 的 ， 未 经 许可 的 ， 未 经 批准 的 
intruder [in'tru:de] n. 入 侵 者 
illegal [ilizgel] adf. 不 合法 的 ， 违 法 的 
malfunction [maelfAnkJan] 1. 故障， 失灵 ， 功 能 失常 
algorithm ['‘zelgaridam] nn. 算 法 
deterministic [dite:mi'nistik] adj. 确 定性 的 
predictable [pri'diktebl] adj. 可 预言 的 
event-driven [i'vent-'drivn] n.S (3S 
interrupt [inte'rapt] wt 中 断 

nn. 中 断 信 号 
preemptive [pri:'emptiv] adfy. 优 先 的 ， 抢 先 的 
mnultitasking [‘maltita:skin] 7. 多 任务 处 理 
slice [slais] n.—th, BH, 片段 

v Or) 
multithreading — ['mAlti'eredin] nn. 多 线程 
thread [ered] .线程 


embedded [em'bedid] adj. A, WAN, WE 
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autonomy [9:'tonemi] 
compact [kompeekt] 
definitely [‘definitli] 
batch [baetf] 
interactive Linter'ektiv] 


absence [eebsens] 


XA Phrases 


video card 

sound card 

along with... 

break into 

a member of 

come out 

liein 

real-time operating system 
multitasking operating system 
aim at 

scheduling algorithm 
time-sharing operating system 
clock interrupt 

multi-user operating system 
single-user operating system. 
time-sharing system 

be classified as ... 
single-tasking operating system 
distributed operating system 
embedded computer system 
mobile operating system 
wireless communication 
batch processing 

online processing system 
offline processing system 


nie 

adj RH, KEW, we 
adv. 明 确 地 , 干脆 地 

nn. 批 处 理 

adj X ERI 
nA, kE, RL, RA 


视频 卡 

声卡 

连同 …… 一 起 ， 随 同 …… 一 起 
XX, RIVA, Beh 
—^ 

出 现 

存在 于 

实时 操作 系统 

多 任务 操作 系统 
瞄准 ， 针 对 

调度 算法 

分 时 操作 系统 

时 钟 中 断 


单 任务 操作 系统 
分 布 式 操作 系统 
嵌入 式 计算 机 系统 
移动 操作 系统 

无 线 通 信 

批 处 理 

在 线 处 理 系 统 
离线 处 理 系 统 
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X Abbreviations 
CD (Compact Disc) 光盘 
PDA (Personal Digital Assistant) 个 人 数字 助理 
OS (Operating System) 操作 系统 
XA Notes 


[1] It translates commands from the operating system or user into commands understood by 
the component part it interfaces with. 
本 句 中 ，from the operating system or user 是 一 个 介词 短语 ， 作 定语 ， 修 饰 和 限定 它 前 
面 的 commands。understood by the component part it interfaces with 是 一 个 过 去 分 词 短 
语 , 作 定 语 , 修饰 和 限定 它 前 面 的 commands. 在 该 过 去 分 词 短 语 中 , it interfaces with 


翻译 成 eer Lan 


[2] The operating system makes these interfacing functions along with its other functions 
operate smoothly and these functions are mostly transparent to the user. 
本 句 中 ，The operating system fF 3:18, makes 作 谓语 ，these interfacing functions along 
with its other functions {Ei}, operate smoothly 是 一 个 不 带 to 的 动词 不 定式 短语 ， 
作 宾 语 补足 语 。 
英语 中 ， 当 make. let. have. see. hear. watch, notice. feel 等 动词 后 面 用 不 定式 
作 宾 语 补足 语 时 ， 不 定式 都 不 带 to。 这 一 点 特别 重要 。 请 看 下 例 : 
I often hear people talk about this kind of printer. 
我 经 常 听 人 们 谈论 这 种 打印 机 。 
Please don’t forget to have him help you with your computing. 
请 别 忘 了 让 他 帮 你 做 运算 。 

[3] These can happen due to a software bug typically in the operating system, although 
computer programs being run on the operating system can make the system more unstable 


or may even crash the system by themselves. 

本 句 中 ，due to a software bug typically in the operating system 是 一 个 原因 状语 从 句 。 
due to 的 意思 是 “由 于 , 因为 ”。 although computer programs being run on the operating 
system can make the system more unstable or may even crash the system by themselves 
是 一 个 让 步 状语 从 句 。 在 该 从 句 中 ，compnuter programs {ff +218, being run on the 
operating system 作 定 语 ， 修 饰 computer programs, can make 作 谓 语 ，the system 作 宾 
i, more unstable 作 宾 语 补足 语 ，or 是 连词 ， 连 接 并 列 谓语 ， 意 思 是 “或 者 ”。 
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[4] When a single program is allowed to run at a time, the system is grouped under the 
single-tasking system category, while in case the operating system allows for execution of 
multiple tasks at a time, it is classified as a multi-tasking operating system. 

AAI}, while 表示 对 比 ， 意 思 是 “而 ， 但 是 ”， 它 连接 了 两 个 复合 句 。 这 两 个 复合 
句 分 别 解释 了 什么 是 single-tasking system， 什 么 是 multi-tasking operating system. 
英语 中 ，while 在 不 同 的 语 境 中 所 表达 的 意思 不 同 。 请 看 下 例 : 

While the discussion was still going on, George came in. 

当 讨 论 还 在 进行 时 ， 乔 治 走 了 进来 。( 当 …… 时 ， 表示 时 间 ) 

Multi-user computer operating systems allow multiple users to access a computer system 
simultaneously while single-user operating systems are usable by only one user at a time. 
多 用 户 计算 机 操作 系统 允许 多 个 用 户 同 时 访问 一 个 计算 机 系统 ， 而 单 用 户 操作 系统 
一 次 只 能 被 一 个 用 户 使 用 。( 而 ， 但 是 ; 表示 对 比 ) 

While this printer is of good quality, I think it is too expensive. 

尽管 这 台 打 印 机 质量 很 好 ， 但 我 认为 还 是 太 贵 了 。( 虽 然 ， 尽 管 ， 表 示 让 步 ) 

We can surely overcome these difficulties while we make our best. (只 要 ;表示 条 件 ) 
只 要 我 们 竭尽 全 力 ， 就 一 定 能 克服 这 些 困难 。 


[5] An operating system that manages a group of independent computers and makes them 


appear to be a single computer is known as a distributed operating system. 

本 人 句 中 ，that manages a group of independent computers and makes them appear to be a 
single computer 是 一 个 定语 从 句 ， 修 饰 和 限定 An operating system. be known as 的 意 
思 是 “被 称 为 ……”。 


XA Exercises 


[Ex 1] 根据 课文 内 容 回答 问题 。 

1. What is an operating system? 

2. What does a driver do? 

3. What are system tools (programs) used to? 

4. Why can there be errors in the code even though there may be some testing before the 
product is released? 

5. What are the three main types of problems errors in operating systems cause? 

6. What is a system crash? 

7. What may happen if there are security flaws? What should we do? 

8. Are there many types of operating systems? What is the most common one? 

9. What can time-sharing systems be classified as? What is the difference between them? 

10. What are embedded operating systems? What are they designed to do? 
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【Ex. 2】 根据 下 面 的 英文 解释 ， 写 出 相应 的 英文 词汇 。 


1. : To set in place and prepare for operation. 

2 : A signal that initiates an operation defined by an instruction. 

3: : In programming, to convert a program from one language to another. 

4 : An error or a fault resulting from defective judgment, deficient knowledge, or 
carelessness. 

3: : A particular version of a piece of software, most commonly associated with the 
most recent version. 

6. : Management of a computer and its processing abilities so as to maintain order 
as tasks and activities are carried out. 

d : For a system or program, to fail to function correctly, resulting in the suspension 
of operation. 

8. : The quality or condition of being erratic or undependable. 

9. : To turn a computer off and then on again; restart the operating system. 

10. : A combination of input, output, and computing hardware that can be used for 


work by an individual. 


【Ex. 3】 把 下 列 句 子 翻译 为 中 文 。 

1. It loads the operating system into memory and allows it to begin operation. 

2. On the computer, there are two basic types of items that need to be organized. 

3. Fonts are used by computer for on-screen display and printers for hardcopy output. 

4. Optical fiber is thin filaments of glass through which light beams are transmitted. 

5. When you type things on the keyboard, the letters and numbers show up on the monitor. 

6. An intranet is a private network. There are many intranets scattered all over the world. 

7. On the computer screen, a folder most often looks like a yellow or blue paper file folder. 

8. Once you’ve encoded your source content, the process of creating streaming media is 
complete. 

9. Syntactically, a domain name consists of a sequence of names (labels) separated by periods 
(dots). 

10. The quality of video you see on your monitor depends on both the video card and the 

monitor you choose. 


【Ex. 4】 将 下 列 词 填 人 适当 的 位 置 (每 词 只 用 一 次 )。 


graphical through applications defined interact 
boot command use loaded requests 


An operating system (sometimes abbreviated as “OS” ) is the program that, after being 
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initially _ (1) | into the computer by a _ (2) | program, manages all the other programs 
in a computer The other programs are called __(3) ^ or application programs. The 
application programs make _ (4) | of the operating system by making _ (5) _ for services 
through a , (6) | application program interface (API). In addition, users can _(7)_ directly 
with the operating system (8) _ a user interface such as a (9) language or 
a 10) — user interface (GUI). 


Text B 


ETL 


In computing, Extract, Transform, Load (ETL) refers to a process in database usage and 
especially indata warehousing. The ETL process became a popular concept in the 
1970s. Data extraction is where data is extracted from homogeneous or heterogeneous data 
sources; data transformation where the data is transformed for storing in the proper format or 
structure for the purposes of querying and analysis; data loading where the data is loaded into 
the final target database, more specifically, an operational data store, data mart, or data 
warehouse. 

Since data extraction takes time, it is common to execute the three phases in parallel. 
While the data is being extracted, another transformation process executes while processing 
the data already received and prepares it for loading while the data loading begins without 
waiting for the completion of the previous phases. 

ETL systems commonly integrate data from multiple applications (systems), typically 
developed and supported by different vendors or hosted on separate computer hardware. The 
disparate systems containing the original data are frequently managed and operated by 
different employees. For example, a cost accounting system may combine data from payroll, 
sales, and purchasing. 


1. Extract 


The first part of an ETL process involves extracting the data from the source system(s). 
In many cases, this represents the most important aspect of ETL, since extracting data 
correctly sets the stage for the success of subsequent processes. Most data-warehousing 
projects combine data from different source systems. Each separate system may also use a 
different data organization and/or format. Common data-source formats include relational 
databases, XML and flat files, but they may also include non-relational database structures 
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such as Information Management System (IMS) or other data structures such as Virtual 
Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even 
formats fetched from outside sources by means such as web spidering or screen scraping. The 
streaming of the extracted data source and loading on-the-fly to the destination database is 
another way of performing ETL when no intermediate data storage is required. In general, the 
extraction phase aims to convert the data into a single format appropriate for transformation 
processing. 

An intrinsic part of the extraction involves data validation to confirm whether the data 
pulled from the sources has the correct/expected values in a given domain (such as a 
pattern/default or list of values). If the data fails the validation rules it is rejected entirely or in 
part. The rejected data is ideally reported back to the source system for further analysis to 
identify and to rectify the incorrect records. In some cases, the extraction process itself may 
have to do a data-validation rule in order to accept the data and flow to the next phase. 


2. Transform 


In the data transformation stage, a series of rules or functions are applied to the extracted 
data in order to prepare it for loading into the end target. Some data does not require any 
transformation at all; such data is known as "direct move" or "pass through" data. 

An important function of transformation is the cleaning of data, which aims to pass only 

“proper” data to the target. The challenge when different systems interact is in the relevant 
systems' interfacing and communicating. Character sets that may be available in one system 
may not be so in others. 

In other cases, one or more of the following transformation types may be required to 
meet the business and technical needs of the server or data warehouse: 

* Selecting only certain columns to load: (or selecting null columns not to load). For 
example, if the source data has three columns (aka "attributes" ), roll no, age, and 
salary, then the selection may take only roll no and salary. Or, the selection 
mechanism may ignore all those records where salary is not present (salary — null). 

Translating coded values: (e.g., if the source system codes male as “1” and female as 

*2" , but the warehouse codes male as ^M" and female as “F” ) 
* Encoding free-form values: (e.g., mapping "Male" to *M" ) 
* Deriving a new calculated value: (e.g., sale amount = qty * unit price) 
* Sorting or ordering the data based on a list of columns to improve search performance 
* Joining data from multiple sources (e.g., lookup, merge) and deduplicating the data 
* Aggregating (for example, rollup—summarizing multiple rows of data—total sales for 
each store, and for each region, etc.) 


48 


大 数据 专业 英语 教程 


* Generating surrogate-key values 

e Transposing or pivoting (turning multiple columns into multiple rows or vice versa) 

è Splitting a column into multiple columns (e.g., converting a comma-separated list, 
specified as a string in one column, into individual values in different columns) 

* Disaggregating repeating columns 

* Looking up and validating the relevant data from tables or referential files 

e Applying any form of data validation; failed validation may result in a full rejection of 
the data, partial rejection, or no rejection at all, and thus none, some, or all of the data 
is handed over to the next step depending on the rule design and exception handling; 
many of the above transformations may result in exceptions, e.g., when a code 
translation parses an unknown code in the extracted data 


3. Load 


The load phase loads the data into the end target, which may be a simple delimited flat 
file or a data warehouse. Depending on the requirements of the organization, this process 
varies widely. Some data warehouses may overwrite existing information with cumulative 
information; updating extracted data is frequently done on a daily, weekly, or monthly basis. 
Other data warehouses (or even other parts of the same data warehouse) may add new data in 
a historical form at regular intervals—for example, hourly. To understand this, consider a data 
warehouse that is required to maintain sales records of the last year. This data warehouse 
overwrites any data older than a year with newer data. However, the entry of data for any one 
year window is made in a historical manner. The timing and scope to replace or append are 
strategic design choices dependent on the time available and the business needs. More 
complex systems can maintain a history and audit trail of all changes to the data loaded in the 
data warehouse. 

As the load phase interacts with a database, the constraints defined in the database 
schema—as well as in triggers activated upon data load—apply (for example, 
uniqueness, referential integrity, mandatory fields), which also contribute to the overall data 
quality performance of the ETL process. 

* For example, a financial institution might have information on a customer in several 
departments and each department might have that customer's information listed in a 
different way. The membership department might list the customer by name, whereas 
the accounting department might list the customer by number. ETL can bundle all of 
these data elements and consolidate them into a uniform presentation, such as for 
storing in a database or data warehouse. 

* Another way that companies use ETL is to move information to another application 
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permanently. For instance, the new application might use another database vendor and 
most likely a very different database schema. ETL can be used to transform the data 
into a format suitable for the new application to use. 

* An example would be an Expense and Cost Recovery System (ECRS) such as used 
by accountancies, consultancies, and legal firms. The data usually ends up in the time 
and billing system, although some businesses may also utilize the raw data for 
employee productivity reports to Human Resources (personnel dept.) or equipment 
usage reports to Facilities Management. 


4. Real-life ETL cycle 


The typical real-life ETL cycle consists of the following execution steps: 

(1) Cycle initiation 

(2) Build reference data 

(3) Extract (from sources) 

(4) Validate 

(5) Transform (clean, apply business rules, check for data integrity, create aggregates or 
disaggregates) 

(6) Stage (load into staging tables, if used) 

(7) Audit reports (for example, on compliance with business rules. Also, in case of 
failure, helps to diagnose/repair) 

(8) Publish (to target tables) 

(9) Archive 


5. Challenges 


ETL processes can involve considerable complexity, and significant operational 
problems can occur with improperly designed ETL systems. 

The range of data values or data quality in an operational system may exceed the 
expectations of designers at the time validation and transformation rules are specified. Data 
profiling of a source during data analysis can identify the data conditions that must be 
managed by transform rules specifications, leading to an amendment of validation rules 
explicitly and implicitly implemented in the ETL process. 

Data warehouses are typically assembled from a variety of data sources with different 
formats and purposes. As such, ETL is a key process to bring all the data together in a 
standard, homogeneous environment. 

Design analysis should establish the scalability of an ETL system across the lifetime of 
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its usage 一 including understanding the volumes of data that must be processed 
within service level agreements. The time available to extract from source systems may 
change, which may mean the same amount of data may have to be processed in less time. 
Some ETL systems have to scale to process terabytes of data to update data warehouses with 
tens of terabytes of data. Increasing volumes of data may require designs that can scale from 
daily batch to multiple-day micro batch to integration with message queues or real-time 
change data capture for continuous transformation and update. 


6. Performance 


ETL vendors benchmark their record-systems at multiple TB (terabytes) per hour (or ~ 
1 GB per second) using powerful servers with multiple CPUs, multiple hard drives, multiple 
gigabit-network connections, and lots of memory. 

In real life, the slowest part of an ETL process usually occurs in the database load phase. 
Databases may perform slowly because they have to take care of concurrency, integrity 
maintenance, and indices. Thus, for better performance, it may make sense to employ: 

* Direct Path Extract method or bulk unload whenever is possible (instead of querying 

the database) to reduce the load on source system while getting high speed extract; 

* Most of the transformation processing outside of the database; 

* Bulk load operations whenever possible. 

Still, even using bulk operations, database access is usually the bottleneck in the ETL 
process. Some common methods used to increase performance are: 

* Partition tables (and indices): try to keep partitions similar in size (watch for null values 

that can skew the partitioning): 

* Do all validation in the ETL layer before the load: disable integrity checking (disable 
constraint...) in the target database tables during the load; 

* Disable triggers (disable trigger..) in the target database tables during the load: 
simulate their effect as a separate step; 

* Generate IDs in the ETL layer (not in the database); 

* Drop the indices (on a table or partition) before the load and recreate them after the 
load (SQL: drop index...; create index...); 

* Use parallel bulk load when possible—works well when the table is partitioned or 
there are no indices (Note: attempt to do parallel loads into the same table (partition) 
usually causes locks—if not on the data rows, then on indices); 

* If a requirement exists to do insertions, updates, or deletions, find out which rows 
should be processed in which way in the ETL layer, and then process these three 
operations in the database separately; you often can do bulk load for inserts, but 


| Unit 3 (s1) 


updates and deletes commonly go through an API (using SQL). 

Whether to do certain operations in the database or outside may involve a trade off. For 
example, removing duplicates using distinct may be slow in the database; thus, it makes sense 
to do it outside. On the other side, if using distinct significantly (x100) decreases the number 
of rows to be extracted, then it makes sense to remove duplications as early as possible in the 
database before unloading data. 

A common source of problems in ETL is a big number of dependencies among ETL jobs. 
For example, job“ B "cannot start while job“ A "is not finished. One can usually achieve better 
performance by visualizing all processes on a graph, and trying to reduce the graph making 
maximum use ofparallelism, and making "chains" of consecutive processing as short as 
possible. Again, partitioning of big tables and their indices can really help. 

Another common issue occurs when the data are spread among several databases, and 
processing is done in those databases sequentially. Sometimes database replication may be 
involved as a method of copying data between databases, but it can significantly slow down 
the whole process. The common solution is to reduce the processing graph to only three 
layers: 

* Sources 

* Central ETL layer 

e Targets 

This approach allows processing to take maximum advantage of parallelism. For 
example, if you need to load data into two databases, you can run the loads in parallel (instead 
of loading into the first, and then replicating into the second). 

Sometimes processing must take place sequentially. For example, dimensional (reference) 
data are needed before one can get and validate the rows for main “fact” tables. 


XW New Words 


load [laud] Vt 装载， 加 载 ， 装 填 

nits, 装载 量 ， 工 作 量 ， 负 和 载 ， 加 载 
homogeneous [hoəməu'dzi:niəs] ”adj. 同 类 的 ， 相 似 的 
heterogeneous — [hetereu'dsi:inies] adj. FAW, HAW 


store [sto:] vt. 存储 ,保管 
format [fo:maet] nR, ØR 
parallel [paeralel] adj. FATH, KHY 
v. 并 行 ， FAT 
integrate [intigreit] Vi 集成， 使 成 整体 ， 使 一 体 化 


represent [reprizent] Vt. 表现 ， 扮 演 
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subsequent 
combine 
streaming 
intermediate 
intrinsic 
validation 
default 
reject 
rectify 

tule 


interact 
column 
mechanism 
ignore 
encode 

sort 

order 

join 

lookup 
duplicate 
pivot 
disaggregate 
rejection 
handling 


delimit 
overwrite 
cumulative 
trigger 
uniqueness 
integrity 
permanently 
cycle 


validate 
compliance 


[s^bsikwent] 
[kem'bain] 
[stri:min] 
[inte'mi:diet] 
[in'trinsik] 
[veeli'deif en] 
[di'fo:It] 
[ri'dgekt] 
[rektifai] 
[ru:l] 


[inter'zekt] 
[kələm] 
[mekenizem] 
[ig'no:] 
[in'keud] 
[so:t] 

[o:da] 
[dzoin] 
[lukAp] 
[dju:plikeit] 
[pivet] 
[dis'zegrigeit] 
[ridsekjJ en] 
[haendlin] 


[di limit] 
[Leuvarait] 
[kju:mjulativ] 
[triga] 
[ju'ni:knis] 
[in'tegriti] 
[pe:mentli] 
[saikl] 


[ vaelideit] 
[kem'plaians] 


adj. 后 来 的 ， 并 发 的 

v. E) RG, CE) 结合 
ni. 

adj. 中 间 的 

adj.〔 指 价值 、 性 质 ) 固 有 的 ， 内 在 的 ， 本质 的 
17. 确认 AR 

nik CD , Re E) 
Vt. 拒绝， 抵制 ， 丢 弃 

wt 矫正， 调整 

n. 规 则 ， 惯 例 ， 准 则 ， 标 准 
wee, Hie, XE 

vi TAER, ARH 


n4, 5| 

1. 机 制 

wt. AER, ZW 
vt. ig b 

vt HE FF 

vt 排序， 分 类 
VL 连接 

Vv. 查找 

vr. 复制， 重复 
Wi. 转 置 

Vv. 去 除 ， 分 解 
np, EF 
.处 理 
adj. 操 作 的 

Vt 定 界限 ， 划 界 
v5, 覆盖 
adj. ZIR 


vi. 引 发 ， 引 起， 触发 

n. 唯 一 性 ， 单 值 性 ， 独 特性 
nn 完整 性 

adv. 永 存 地 ， 不 变 地 

n. 周 期 ， 循 环 

Wi 循环 

让 确认 ， 证 实 ， 验 证 

nn. 合 规 ， 依 从 


diagnose [daiegneuz] 
repair [ri pea] 
considerable [Ken'siderebl] 
improperly [im propeli] 
designer [dizaine] 
specify [spesifai] 
condition [ken'dif en] 
specification [.spesifi'keif ən] 
amendment [e'mendment] 
scalability [.skeilo'biliti] 
agreement [e'gri:ment] 
continuous [ken'tinjues] 
bulk [balk] 

unload [Anlaud] 
bottleneck [botlnek] 
disable [dis'eibl] 
insertion [in'se:f en] 
parallelism [paerelelizem] 
replication [repli'keif en] 
dimensional [di'menf enel] 
XA Phrases 

refer to 


data warehousing 
data extraction 

data source 

data transformation 
data loading 
operational data store 
data mart 

take time 

original data 

data organization 


了 诊断 

nn. 修 理 ， 修 补 

VI. 修理， 修补， 补救 ， 纠 正 
adj. 相 当 大 (或 多 ) 的 ， 相 当 可 观 的 
adv. 不 正确 地 ， 不 适当 地 
.设计 者 

vi. 指定 

nn 条件， 情形 ， 环 境 
WE 为 条 件 ， 使 达到 要 求 的 情况 
7 规范， 规格， 说 明 书 
nd, KE 

n. 可 量 测 性 

.协定 ， 协 议 

adj. 连 续 的 ， 持 续 的 

n. 大 批 ， 大 多 数 ; 散装 

Vt. 显得 大 ， 显 得 重要 
vR 

17. 瓶颈 

Vv. 使 无 效 ， 使 失去 能 力 
nA 

nn. 平行 ， 并 行 

1. 复制 

adj. 维 的 ， 空 间 的 


指 的 是 ; 涉及 ; 适用 于 
数据 入 库 ， 数 据 存 入 
数据 提取 

数据 源 
数据 变换 ， 数 据 转换 
数据 载 入 ， 数 据 装载 
操作 型 数据 存储 
数据 集 市 ， 数 据 市 场 
费时 

原始 数据 
数据 结构 ， 数 据 组 织 
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relational database 关系 数据 库 
flat file 平面 文件 
non-relational database 非 关 系 型 数据 库 


web spidering 
screen scraping 


WEM, ARMA, ARE 
屏幕 抓 取 


on-the-fly 在 不 停机 状态 下 ， 即 时 

convert ... into ... Mes 转换 为 …… 

appropriate for 适 于 ,合乎 

data validation 数据 有 效 性 

a series of 一 连 串 的 ， 一 系列 的 

be applied to 适用 于 ， 应 用 于 ， 施 加 于 

character set 字符 集 

calculated value 计算 值 

look up 查找 

hand over 移交 

audit trail 审计 跟踪 

mandatory field 必 备 字段 ， 必 填 字 段 

data element 数据 元 素 

be suitable for ... 适合 …”…- 的 

reference data 参考 数据 

data integrity 数据 完整 性 

staging table 临时 表 

transformation rule 转换 规则 

data profiling 数据 剖析 

message queue 消息 队列 

change data capture 变更 数据 捕获 

partition table 分 区 表 

trade off 权衡 

consecutive processing 串 行 处 理 ， 顺 序 处 理 ， 连 续 处 理 
XA Abbreviations 

ETL (Extract, Transform, Load ) 抽取 、 转 换 、 加 载 

XML (eXtensible Markup Language) 可 扩展 标记 语言 


VSAM (Virtual Storage Access Method) 
ISAM (Indexed Sequential Access Method) 
ECRS (Expense and Cost Recovery System) 


虚拟 存储 存 取 方 法 ， 虚 拟 存储 访问 方法 
索引 顺序 存 取 方法 ， 索 引 顺 序 访问 方法 
费用 与 成 本 回收 系统 
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API (Application Programming Interface) 应 用 程序 编程 接口 


XA Exercises 


【Ex. 5】 根据 课文 内 容 回答 问题 。 

1. What does ETL stand for? What does it refer to in computing? 

2. What does ETL systems commonly do? 

3. What does the first part of an ETL process involve? 

4. Why does extracting data represent the most important aspect of ETL in many cases? 
5. What does the extraction phase aim to do in general? 

6. What happens if the data fails the validation rules? 

7. What is an important function of transformation? 

8. What does the load phase do? 

9. Where does the slowest part of an ETL process usually occur in real life? Why? 
10. What is a common source of problems in ETL? 


参考 译文 


操作 系统 


1. 什么 是 操作 系统 


操作 系统 是 计算 机 的 核心 软件 。 它 执行 许多 功能 ， 用 很 基本 的 术语 说 ， 它 是 计算 机 
与 外 部 设备 之 间 的 接口 。 在 硬件 一 节 中 ， 计 算 机 被 描述 为 由 许多 独立 的 部 件 组 成 ， 包 括 
显示 器 、 键 盘 、 鼠 标 及 其 他 部 件 。 操 作 系统 使 用 所 谓 的 “驱动 程序 ”给 这 些 部 件 提供 接 
口 。 这 就 是 为 什么 当 你 安装 一 个 新 打印 机 或 其 他 硬件 时 ， 系 统 会 问 你 是 否 进一步 安装 称 
作 驱 动 程序 的 软件 。 


2. 驱动 程序 做 什么 


驱动 程序 是 一 个 经 过 编写 的 特殊 程序 ， 它 了 解 与 其 接口 的 设备 〈 如 打印 机 、 显 卡 、 
声卡 或 光盘 驱动 器 ) 的 操作 。 它 把 来 自 操作 系统 或 用 户 的 命令 翻译 为 其 接口 的 设备 可 以 
理解 的 命令 。 它 也 把 来 自 这 些 部 件 的 响应 翻译 为 操作 系统 、 应 用 软件 或 用 户 可 以 理解 的 
响应 。 
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3. 其 他 操作 系统 的 功能 


操作 系统 的 其 他 功能 包括 : 

e 用 于 监控 计算 机 执行 、 排 除 故障 或 维护 系统 部 件 的 系统 工具 。 

o 程序 用 来 执行 特殊 任务 、 特别 是 与 计算 机 系统 部 件 接口 相关 任务 的 一 系列 功能 库 
或 函数 。 

操作 系统 使 这 些 接口 功能 及 其 他 功能 平稳 运行 ， 并 且 这 些 功 能 对 用 户 通常 是 透明 的 。 


4. 操作 系统 关注 的 事项 


如 上 所 述 ， 操 作 系统 是 计算 机 程序 。 操 作 系统 由 可 能 出 错 的 程序 员 编写 。 因 此 ， 即 
使 在 发 布 前 进行 了 测试 ， 它 仍 可 能 有 一 些 错误 代码 。 有 些 公司 的 软件 质量 控制 和 测试 优 
于 其 他 公司 ， 所 以 也 许 是 注意 到 了 不 同 的 操作 系统 质量 不 同 。 操 作 系统 的 错误 引起 以 下 
3 类 主要 问题 : 

。 系统 崩溃 和 不 稳定 性 这 通常 由 操作 系统 中 的 软件 错误 引起 ， 尽 管 运行 在 操 
作 系 统 上 的 计算 机 程序 可 以 使 系统 更 不 稳定 ， 甚 至 由 它们 引起 系统 崩溃 。 这 些 变 
化 取决 于 操作 系统 的 类 型 。 系 统 崩 溃 是 系统 冻结 并 且 没 有 反应 的 行为 , 用 户 必须 
重新 启动 。 

e 安全 漏洞 某 些 软 件 错误 为 未 经 授权 的 入 侵 者 打开 进入 系统 的 大 门 。 由 于 这 
些 漏洞 ， 未 经 授权 的 入 侵 者 也 许 试图 使 用 它们 非法 访问 你 的 系统 。 给 这 些 漏洞 打 
补丁 通常 可 以 使 计算 机 系统 变 得 安全 。 

e 功能 失常 有 时 操作 系统 的 错误 可 以 引起 计算 机 及 一 些 外 设 (如 打印 机 ) 不 能 正 
常 工作 。 


5 操作 系统 的 类 型 


让 我 们 来 看 看 不 同类 型 的 操作 系统 并 了 解 它们 之 问 的 区 别 。 

。 实时 操作 系统 

它 是 一 个 多 任务 操作 系统 ， 其 目的 是 执行 实时 应 用 。 实 时 操作 系统 通常 使 用 专门 的 
调度 算法 ， 以 便 可 以 实现 确定 性 的 行为 。 实 时 操作 系统 的 主要 目的 是 对 事件 做 出 快速 和 
可 预测 的 响应 。 其 设计 或 者 是 事件 驱动 的 或 者 是 分 时 的 。 事 件 驱 动 的 系统 基于 优先 级 在 
任务 之 间 切 换 ， 而 分 时 操作 系统 基于 时 钟 中 断 在 任务 之 间 切 换 。 

e 多 用 户 和 单 用 户 的 操作 系统 

多 用 户 计算 机 操作 系统 允许 多 个 用 户 同时 访问 计算 机 系统 。 分 时 系统 可 以 划分 为 多 
用 户 系统 和 单 用 户 操作 系统 。 多 用 户 操作 系统 通过 分 时 让 多 个 用 户 访问 一 个 计算 机 。 与 
其 相对 应 的 单 用 户 操作 系统 一 次 只 有 一 个 用 户 使 用 。Windows 操作 系统 可 能 有 多 个 账户 
但 并 不 是 一 个 多 用 户 系统 。 只 有 网 络 管理 员 是 真正 的 用 户 。 但 对 于 一 个 类 UNIX 操作 系 
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统 ， 它 可 以 让 两 个 用 户 在 同一 时 间 登 录 ， 这 种 性 能 使 其 成 为 多 用 户 操作 系统 。 

e 多 任务 处 理 及 单 任务 操作 系统 

当 一 次 只 能 运行 一 个 程序 时 ， 该 系统 仍 归于 单 任务 系统 。 在 操作 系统 允许 同时 执行 
多 任务 的 情况 下 ， 它 就 归于 多 任务 操作 系统 。 多 任务 处 理 可 以 分 为 两 类 : 抢先 型 与 协作 
型 。 在 抢先 多 任务 操作 系统 中 ， 系 统 为 每 个 程序 切 分 一 段 CPU IS BR. X UNIX 的 操作 
系统 (如 Solaris 和 Linux) 支持 抢先 式 多 任务 。 如 果 你 了 解 多 线程 ， 那 么 可 以 把 这 种 类 
型 的 多 任务 处 理 能 力 当 作 交 错 的 多 线程 。 协作 多 任务 通过 每 一 个 进程 以 确定 的 方式 给 其 
他 进程 分 配 时 间 来 实现 。 这 种 多 任务 类 似 于 块 多 线程 的 想法 ， 在 一 个 线程 中 运行 ， 直 到 
它 被 另外 一 些 事件 闭锁 。 

。 分 布 式 操作 系统 

管理 一 组 独立 的 计算 机 并 使 得 它们 看 起 来 是 一 台 计 算 机 ， 这 就 被 称 为 分 布 式 操作 系 
统 。 可 以 链接 并 彼此 通信 的 计算 机 的 发 展 带 来 了 分 布 式 计算 。 分 布 式 计算 在 多 个 计算 机 
上 进行 。 当 一 组 计算 机 协作 工作 时 就 构成 一 个 分 布 式 系统 。 

e 嵌入 式 操作 系统 

嵌入 式 操作 系统 设计 用 于 嵌入 式 计算 机 系统 。 它 们 为 像 PDA. 这 类 更 少 自治 的 小 型 
机 而 设计 。 它 们 能 够 在 资源 有 限 的 系统 中 运行 。 它 们 非常 紧凑 ， 设 计 效率 非常 高 。 

e 移动 操作 系统 

移动 操作 系统 虽然 在 功能 上 与 其 他 操作 系统 并 没有 明显 的 不 同 ,但 绝对 是 操作 系统 
类 型 列表 中 的 重要 一 项 。 移动 操作 系统 控制 移动 设备 , 其 设计 支持 无 线 通 信和 移动 应 用 。 
它 内 置 支持 移动 多 媒体 格式 。 平 板 电脑 和 智能 手机 都 运行 在 移动 操作 系统 上 。 

e 批 处 理 和 交互 系统 

批 处 理 指 的 是 “ 按 批 ”执行 计算 机 程序 ， 无 需 人 工 干 预 。 在 批 处 理 系 统 中 ， 收 集 程 
序 、 分 组 并 在 稍 后 的 日 期 处 理 。 并 不 提示 用 户 输入 数据 ， 因 为 以 后 要 处 理 的 数据 已 经 提 
前 收集 了 。 因 为 输入 数据 分 批 收集 和 处 理 故 名 批 处 理 。 IBM 的 z/OS 具有 批 处 理 能 力 。 
与 此 相对 ， 交 互 式 的 操作 需要 用 户 干预 。 用 户 不 在 就 不 能 执行 。 

。 在 线 和 离线 处 理 系统 

在 线 数据 处 理 时 ， 用 户 保持 与 计算 机 的 联系 并 在 计算 机 中 央 处 理 单元 的 控制 下 执行 。 
当 进 程 不 在 CPU 的 直接 控制 下 执行 时 , 该 处 理 被 称 为 离线 。 让 我 们 以 批 处 理 为 例 介 绍 。 
这 里 ， 数 据 的 分 批 或 分 组 可 以 无 须 用 户 和 CPU 的 干预 ， 它 可 以 离线 完成 。 但 实际 执行 
过 程 中 可 以 在 处 理 器 直接 控制 下 完成 ， 也 就 是 在 线 完成 。 


Text A 


R Programming Language 


R is a programming language and software environment for statistical computing and 
graphics supported by the R Foundation for Statistical Computing. The R language is widely 
used among statisticians and data miners for developing statistical software and data analysis. 
Polls, surveys of data miners, and studies of scholarly literature databases show that R’s 
popularity has increased substantially in recent years. 

R is a GNU project. The source code for the R software environment is written primarily 
in C, Fortran, and R. R is freely available under the GNU General Public License, and 
precompiled binary versions are provided for various operating systems. While R has a 
command line interface, there are several graphical front-ends available. 


1. History 


R is an implementation of the S programming language combined with lexical scoping 
semantics inspired by Scheme. S was created by John Chambers while at Bell Labs. There are 
some important differences, but much of the code written for S runs unaltered. 

R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New 
Zealand, and is currently developed by the R Development Core Team, of which Chambers is 
a member. The project was conceived in 1992, with an initial version released in 1994 and a 
stable beta version in 2000. 
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2. Statistical features 


R and its libraries implement a wide variety of statistical and graphical techniques, 
including linear and nonlinear modeling, classical statistical tests, time-series analysis, 
classification, clustering, and others. R is easily extensible through functions and extensions, 
and the R community is noted for its active contributions in terms of packages. Many of R’s 
standard functions are written in R itself, which makes it easy for users to follow the 
algorithmic choices made. For computationally intensive tasks, C, C++, and Fortran code can 
be linked and called at run time. Advanced users can write C, C++, Java, .NET or Python 
code to manipulate R objects directly. R is highly extensible through the use of 
user-submitted packages for specific functions or specific areas of study. Due to its S heritage, 
R has stronger object-oriented programming facilities than most statistical computing 
languages. Extending R is also eased by its lexical scoping rules. 

Another strength of R is static graphics, which can produce publication-quality graphs, 
including mathematical symbols. Dynamic and interactive graphics are available through 
additional packages. 

R has Rd, its own LaTeX-like documentation format, which is used to supply comprehensive 
documentation, both on-line in a number of formats and in hard copy. 


3. Programming features 


R is an interpreted language, and users typically access it through a command-line 
interpreter. If a user types 2+2 at the R command prompt and presses enter, the computer 
replies with 4. 

Like other similar languages such as APL and MATLAB, R supports matrix arithmetic. 
R's data structures include vectors, matrices, arrays, data frames (similar to tables in a 
relational database) and lists. R’s extensible object system includes objects for (among others): 
regression models, time-series and geo-spatial coordinates. The scalar data type was never a 
data structure of R. Instead, a scalar is represented as a vector with length one. 

R supports procedural programming with functions and, for some functions, 
object-oriented programming with generic functions. A generic function acts differently 
depending on the classes of arguments passed to it. In other words, the generic function 
dispatches the function (method) specific to that class of object. For example, R has a generic 
print function that can print almost every class of object in R with a simple print(objectname) 
syntax. 

Although used mainly by statisticians and other practitioners requiring an environment 
for statistical computation and software development, R can also operate as a general matrix 
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calculation toolbox 一 with performance benchmarks comparable to GNU Octave or 
MATLAB. 


4. Packages 


The capabilities of R are extended through user-created packages, which include 
specialized statistical techniques, graphical devices (such as the ggplot2 package developed 
by Hadley Wickham), import/export capabilities, reporting tools (knitr, Sweave), etc. These 
packages are developed primarily in R, and sometimes in Java, C, C++, and Fortran. 

A core set of packages is included with the installation of R, with more than 7,801 
additional packages (as of January 2016) available at the Comprehensive R Archive Network 
(CRAN), Bioconductor, Omegahat, GitHub, and other repositories. 

The “Task Views” page (subject list) on the CRAN website lists a wide range of tasks (in 
fields such as Finance, Genetics, High Performance Computing, Machine Learning, Medical 
Imaging, Social Sciences and Spatial Statistics) to which R has been applied and for which 
packages are available. R has also been identified by the FDA as suitable for interpreting data 
from clinical research. 

Other R package resources include Crantastic, a community site for rating and reviewing 
all CRAN packages, and R-Forge, a central platform for the collaborative development of R 
packages, R-related software, and projects. R-Forge also hosts many unpublished beta 
packages, and development versions of CRAN packages. 


5. Interfaces 


5.1 Graphical user interfaces 


e Architect—cross-platform open source IDE for data science based on Eclipse and 
StatET. 

* DataJoy 一 online R Editor focused on beginners to data science and collaboration. 

e Deducer—GUI for menu-driven data analysis (similar to SPSS/JMP/Minitab). 

* Java GUI for R—cross-platform stand-alone R terminal and editor based on Java (also 
known as JGR). 

e Number Analytics—GUI for R based business analytics (similar to SPSS) working on 
the cloud. 

* Rattle GUI—cross-platform GUI based on RGtk2 and specifically designed for data 
mining. 

* R Commander—cross-platform menu-driven GUI based on tcltk (several plug-ins to 
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Remdr are also available). 

e Revolution R Productivity Environment (RPE)— Revolution Analytics-provided 
Visual Studio-based IDE, and has plans for web based point and click interface. 

* RGUI—comes with the precompiled version of R for Microsoft Windows. 

® RKWard—extensible GUI and IDE for R. 

® RStudio—cross-platform open source IDE (which can also be run on a remote Linux 
server). 


5.2 Editors and IDEs 


Text editors and integrated development environments (IDEs) with some support for R 


include: ConTEXT, Eclipse (StatET), Emacs (Emacs Speaks Statistics), LyX (modules for 
knitr and Sweave), Vim, jEdit, Kate, RStudio, Sublime Text, TextMate, Atom, WinEdt (R 
Package RWinEdt), Tinn-R, Notepad++, and Architect. 


5.3 Scripting languages 


R functionality has been made accessible from several scripting languages such as 


Python, Perl, Ruby, F# and Julia. Scripting in R itself is possible via a front-end called littler. 


6. Comparison with SAS, SPSS, and Stata 


The general consensus is that R compares well with other popular statistical packages, 


such as SAS, SPSS, and Stata. In a comparison of all basic features for a statistical software R 
is heads up with the best of statistical software. 


In January 2015, the New York Times ran an article about R gaining acceptance among 


data analysts and presenting a threat for the market share occupied by commercial statistical 
packages, such as SAS. 


XW New Words 

programming [preugraemin] 7. 编程 ， 程 序 设计 

environment [in vaierenment] 7. 环境 

graphics [graefiks] n. 

support [se pa:t] vt. & n. XH, KE 

foundation [faun'deif en] ne, EBS 

analysis [anaelisis] .分析 ; 分 解 

poll [peul] n. 民 意 调查 ; 投票 选举 ; 投票 数 


vid. 做 民意 调查 
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survey 


scholarly 
literature 
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project 
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precompiled 
interface 
front-end 
implementation 
lexical 
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package 
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manipulate 
object 
submit 
heritage 
facility 
strength 
symbol 
document 
comprehensive 
interpret 
access 
prompt 
enter 
arithmetic 
vector 
matrix 
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regression 
coordinate 
scalar 


dispatch 
practitioner 
development 
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toolbox 
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export 
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website 
task 


[link] 


[me'nipjuleit] 
[‘obd3ikt] 
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[simbel] 
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[websait] 
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adf. 全 面 的 ， 广 泛 的 
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7. 提示 符 
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identify [aidentifai] 


collaborative 
unpublished 
collaboration 


remote [rrmaut] 
server [sa:val 


comparison 


consensus [ken'sensas] 
[ek'septens] 


acceptance 


XWA Phrases 


programming language 
data miner 

in recent years 

General Public License 
provide for... 
command line 
combine with ... 
lexical scoping semantics 
Bell Labs 

beta version 

classical statistical test 
time-series analysis 

in terms of 

run time 


object-oriented programming 


lexical scoping rule 
hard copy 
interpreted language 
command prompt 
data frame 
regression model 
scalar data type 

be represented as ... 
generic function 

a set of 


[ke'laebereitiv] 
[^n pablift] 
[ka.laebe'reif an] 


[kam'paerisn] 


程序 设计 
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最 近 几 年 
通用 公共 
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词汇 作用 
贝尔 实验 
测试 版 
经 典 统计 
时 间 序 列 
Li ME 


语言 
者 

中 
许可 证 
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测试 
分 析 
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vt. 识别， 鉴别 ， 确 定 
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nn. 协 作 
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硬 拷贝 

解释 语言 


编程 
域 规则 


赞同 ， 相 信 
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suitable for ... 适合 …… 的 

graphical user interface 图 形 用 户 界面 

focus on 集中 

menu-driven data analysis 菜单 驱动 的 数据 分 析 
special issue 特刊 ， 专 号 

scripting language 脚本 语言 

market share 市 场 份额 ,市 场 占 有 率 
X Abbreviations 

GNU 是 “GNU is Not Unix” 的 说 归 缩 写 
CRAN (Comprehensive R Archive Network) ”RR 综合 归档 网 

FDA (Food and Drug Administration) (美国 ) 食品 及 药物 管理 局 


IDE (Integrated Development Environment) 集成 开发 环境 


XA Notes 


[1] Many of R's standard functions are written in R itself, which makes it easy for users to 
follow the algorithmic choices made. 
本 名 中 , which makes it easy for users to follow the algorithmic choices made 是 一 个 非 
限定 性 定语 从 句 , 对 Many of R's standard functions are written in R itself 进行 补充 说 
明 。 在 该 非 限定 性 定语 从 句 中 ，it 是 形式 宾语 ， 真 正 的 宾语 是 动词 不 定式 短语 to 
follow the algorithmic choices made。 

[2] Another strength of R is static graphics, which can produce publication-quality graphs, 
including mathematical symbols. 
本 句 中 , which can produce publication-quality graphs, including mathematical symbols 
是 一 个 非 限定 性 定语 从 句 ， 对 static graphics 进行 补充 说 明 。 


[3] Although used mainly by statisticians and other practitioners requiring an environment 


for statistical computation and software development, R can also operate as a general 
matrix calculation toolbox — with performance benchmarks comparable to GNU 
Octave or MATLAB. 
本 句 中 ，Although used mainly by statisticians and other practitioners requiring an 
environment for statistical computation and software development 是 一 个 过 去 分 词 短 
语 ， 作 状语 。 

[4] The capabilities of R are extended through user-created packages, which include 


specialized statistical techniques, graphical devices (such as the ggplot2 package 


66) 大 数据 专业 英语 教程 


developed by Hadley Wickham), import/export capabilities, reporting tools (knitr, 
Sweave), etc. 
本 句 中 ，which include specialized statistical techniques, graphical devices (such as the 


ggplot2 package developed by Hadley Wickham), import/export capabilities, reporting 
tools (knitr, Sweave), etc. 是 一 个 非 限定 性 定语 从 句 ， 对 user-created packages 进行 
补充 说 明 。developed by Hadley Wickham 是 一 个 过 去 分 词 短语 ， 作 定语 ， 修 饰 和 限 
定 ggplot2 package。 

[5] The“ Task Views "page (subject list) on the CRAN website lists a wide range of tasks (in 

fields such as Finance, Genetics, High Performance Computing, Machine Leaming, 
Medical Imaging, Social Sciences and Spatial Statistics) to which R has been applied 
and for which packages are available. 
本 句 中 ,on the CRAN website 是 一 个 介词 短语 , 作 定语 , 修饰 和 限定 The“ Task Views" 
page。to which R has been applied and for which packages are available 是 两 个 介词 前 
置 的 定语 从 句 ， 修 饰 和 限定 a wide range of tasks. 

[6] Other R package resources include Crantastic, a community site for rating and reviewing 


all CRAN packages, and R-Forge, a central platform for the collaborative development 
of R packages, R-related software, and projects. 

本 句 中 ，a community site for rating and reviewing all CRAN packages 对 Crantastic 
packages 进行 补充 说 明 。a central platform for the collaborative development of R 
packages, R-related software, and projects 对 user-created packages 进行 补充 说 明 。 


XA Exercises 


[Ex 1] 根据 课文 内 容 回答 问题 。 

1. What is R? 

2. What is the purpose for statisticians and data miners widely use the R language? 

3. In what languages is the source code for the R software environment written primarily? 

4. By whom was R created? And where? 

5. What are the statistical and graphical techniques R and its libraries implement? 

6. Why does R has has stronger object-oriented programming facilities than most statistical 
computing languages? 

7. What do R’s data structures include? 

8. What do R’s extensible object system include? 

9. What has R also been identified by the FDA as? 

10. What do text editors and integrated development environments (IDEs) with some support 

for R include? 
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[Ex 31 短文 翻译 。 
What is R? 


1. Introduction to R 


R is a language and environment for statistical computing and graphics. It is a GNU 
project which is similar to the S language and environment which was developed at Bell 
Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. 
R can be considered as a different implementation of S. There are some important differences, 
but much code written for S runs unaltered under R. 

R provides a wide variety of statistical (linear and nonlinear modelling, classical 
statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, 
and is highly extensible. The S language is often the vehicle of choice for research in 
statistical methodology, and R provides an Open Source route to participation in that activity. 

One of R's strengths is the ease with which well-designed publication-quality plots can 
be produced, including mathematical symbols and formulae where needed. 

R is available as Free Software under the terms of the Free Software Foundation's GNU 
General Public License in source code form. It compiles and runs on a wide variety of UNIX 
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. 


2. The R environment 


R is an integrated suite of software facilities for data manipulation, calculation and 


graphical display. It includes: 
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e An effective data handling and storage facility, 

* A suite of operators for calculations on arrays, in particular matrices, 

e A large, coherent, integrated collection of intermediate tools for data analysis, 

* Graphical facilities for data analysis and display either on-screen or on hardcopy, and 

e A well-developed, simple and effective programming language which includes 

conditionals, loops, user-defined recursive functions and input and output facilities. 

R, like S, is designed around a true computer language, and it allows users to add 
additional functionality by defining new functions. For computationally-intensive tasks, C, 
C++ and Fortran code can be linked and called at run time. Advanced users can write C code 
to manipulate R objects directly. 


[Ex 4】 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


What Is Python? 


1. What Is Python? 


The Python programming language is freely available and makes solving a computer 
problem almost as easy as writing out your thoughts about the solution. The _ (1) _ can be 
written once and run on almost any computer without needing to change the — (2). . 


2. How Python Is Used? 


Python is a general purpose programming language that can be used on any modern 
computer operating system. It can be used for _ (3) | text, numbers, images, scientific data 
and just about anything else you might _ (4) on a computer. It is used daily in the _ (5) — of 
the Google search engine, the video-sharing website YouTube, NASA and the New York 
Stock Exchange. These are but a few of the places where Python plays important roles in the 
success of business, government and non-profit organizations; there are many others. 

Python is an (6) | language. This means that it is not converted to computer- readable 
code before the program is run but at runtime. In the past, this _ (7) _ of language was 
called a scripting language, intimating its use was for trivial tasks. However, programming 
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languages such as Python have forced a change in that nomenclature. Increasingly, large 
applications are written almost exclusively in Python. 


3. How does Python compare to Java? 


Both Python and Java are object-oriented languages with substantial _ (8) ^ of pre- 
written code that can be run on almost any operating system. However, their implementations 
are vastly different. 

Java is neither an interpreted language nor a compiled language. It is a bit of both. When 
compiled, Java programs are compiled to bytecode—a Java-specific type of code. When the 
program is run, this bytecode is run through a Java Runtime Environment to convert it to 
machine code, which is — (9) — and executable by the computer. Once compiled to bytecode, 
Java programs cannot be modified. 

Python programs, on the other hand, are typically __(10) at the time of running, 
when the Python interpreter reads the program. However, they can be compiled to 
computer-readable machine code. Python does not use an intermediary step for platform 
independence. Instead, platform independence is in the implementation of the interpreter. 


Text B 


Python Programming Language 


Python is a widely used high-level, general-purpose, interpreted, dynamic programming 
language. Its design philosophy emphasizes code readability, and its syntax allows 
programmers to express concepts in fewer lines of code than possible in languages such as 
C++ or Java. The language provides constructs intended to enable clear programs on both a 
small and large scale. 

Python supports multiple programming paradigms, including object-oriented, imperative 
and functional programming or procedural styles. It features a dynamic type system and 
automatic memory management and has a large and comprehensive standard library. 

Python interpreters are available for many operating systems, allowing Python code to 
run on a wide variety of systems. Using third-party tools, such as Py2exe or Pyinstaller, 
Python code can be packaged into stand-alone executable programs for some of the most 
popular operating systems, so Python-based software can be distributed to, and used on, those 
environments with no need to install a Python interpreter. 
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1. Features and philosophy 


Python is a multi-paradigm programming language: object-oriented programming and 
structured programming are fully supported, and many language features support functional 
programming and aspect-oriented programming. Many other paradigms are supported via 
extensions, including design by contract and logic programming. 

Rather than requiring all desired functionality to be built into the language’s core, Python 
was designed to be highly extensible. Python can also be embedded in existing applications 
that need a programmable interface. 


2. Syntax and semantics 


Python is intended to be a highly readable language. It is designed to have an uncluttered 
visual layout, often using English keywords where other languages use punctuation. 
Furthermore, Python has fewer syntactic exceptions and special cases than C or Pascal. 


2.1 Indentation 


Python uses whitespace indentation, rather than curly braces or keywords, to delimit 
blocks; this feature is also termed the off-side rule. An increase in indentation comes after 
certain statements; a decrease in indentation signifies the end of the current block. 


2.2 Statements and control flow 


Python’s statements include (among others): 

The assignment statement (token“=”, the equals sign). This operates differently than in 
traditional imperative programming languages, and this fundamental mechanism 
(including the nature of Python’s version of variables) illuminates many other 
features of the language. Assignment in C, e.g., x = 2, translates to “typed variable 
name x receives a copy of numeric value 2” . The (right-hand) value is copied into an 
allocated storage location for which the (left-hand) variable name is the symbolic 
address. The memory allocated to the variable is large enough (potentially quite large) 
for the declared type. In the simplest case of Python assignment, using the same 
example, x = 2, translates to “ (generic) name x receives a reference to a separate, 
dynamically allocated object of numeric (int) type of value 2." This is termed binding 
the name to the object. Since the name's storage location doesn’t contain the indicated 
value, it is improper to call it a variable. Names may be subsequently rebound at any 
time to objects of greatly varying types, including strings, procedures, complex 
objects with data and methods, etc. Successive assignments of a common value to 
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multiple names, e.g., x = 2; y = 2; z = 2 result in allocating storage to (at most) three 
names and one numeric object, to which all three names are bound. Since a name is a 
generic reference holder it is unreasonable to associate a fixed data type with it. 
However at a given time a name will be bound to some object, which will have a type: 
thus there is dynamic typing. 

e The if statement, which conditionally executes a block of code, along with else and elif 
(a contraction of else-if). 

* The for statement, which iterates over an iterable object, capturing each element to a 
local variable for use by the attached block. 

* The while statement, which executes a block of code as long as its condition is true. 

* The try statement, which allows exceptions raised in its attached code block to be 
caught and handled by except clauses; it also ensures that clean-up code in a finally 
block will always be run regardless of how the block exits. 

* The class statement, which executes a block of code and attaches its local namespace 
to a class, for use in object-oriented programming. 

* The def statement, which defines a function or method. 

* The with statement (from Python 2.5), which encloses a code block within a context 
manager (for example, acquiring a lock before the block of code is run and releasing 
the lock afterwards, or opening a file and then closing it), allowing Resource 
Acquisition Is Initialization (RAII)-like behavior. 

* The pass statement, which serves as a NOP. It is syntactically needed to create an 
empty code block. 

* The assert statement, used during debugging to check for conditions that ought to 
apply. 

* The yield statement, which returns a value from a generator function. From Python 2.5, 
yield is also an operator. This form is used to implement coroutines. 

* The import statement, which is used to import modules whose functions or variables 
can be used in the current program. 

* The print statement was changed to the print( ) function in Python 3. 


2.3 Expressions 


Some Python expressions are similar to languages such as C and Java, while some are 
not: 
* Addition, subtraction, and multiplication are the same, but the behavior of division 
differs. Python also added the ** operator for exponentiation. 
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* As of Python 3.5, it supports matrix multiplication directly with the @ operator, versus 
C and Java, which implement these as library functions. Earlier versions of Python 
also used methods instead of an infix operator. 

* In Python, — compares by value, versus Java, which compares numerics by value and 
objects by reference. (Value comparisons in Java on objects can be performed with 
the equals() method.) Python's is operator may be used to compare object identities 
(comparison by reference). In Python, comparisons may be chained, for example a <= 
b«-c. 

* Python uses the words and, or, not for its boolean operators rather than the symbolic 
&&, ||, ! used in Java and C. 

e Python has a type of expression termed a list comprehension. Python 2.4 extended list 
comprehensions into a more general expression termed a generator expression. 

* Anonymous functions are implemented using lambda expressions; however, these are 
limited in that the body can only be one expression. 

* Conditional expressions in Python are written as x if c else y (different in order of 
operands from the c ? x : y operator common to many other languages). 

* Python makes a distinction between lists and tuples. Lists are written as [1, 2, 3], are 
mutable, and cannot be used as the keys of dictionaries (dictionary keys must be 
immutable in Python). Tuples are written as (1, 2, 3), are immutable and thus can be 
used as the keys of dictionaries, provided all elements of the tuple are immutable. The 
parentheses around the tuple are optional in some contexts. Tuples can appear on the 
left side of an equal sign; hence a statement like x, y = y. x can be used to swap two 
variables. 

* Python has a “string format” operator %. This functions analogous to printf format 
strings in C. 

* Python has various kinds of string literals: 

(1) Strings delimited by single or double quote marks. Unlike in Unix shells, Perl 
and Perl-influenced languages, single quote marks and double quote marks function 
identically. Both kinds of string use the backslash (\) as an escape character and there 
is no implicit string interpolation such as “$spam” . 

(2) Triple-quoted strings, which begin and end with a series of three single or 
double quote marks. They may span multiple lines and function like here documents 
in shells, Perl and Ruby. 

(3) Raw string varieties, denoted by prefixing the string literal with an r. No 
escape sequences are interpreted; hence raw strings are useful where literal 
backslashes are common, such as regular expressions and Windows-style paths. 
Compare “@-quoting” in C£. 
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e Python has array index and array slicing expressions on lists, denoted as a[key], 
a[start:stop] or a[start:stop:step]. Indexes are zero-based, and negative indexes are 
relative to the end. Slices take elements from the start index up to, but not including, 
the stop index. The third slice parameter, called step or stride, allows elements to be 
skipped and reversed. Slice indexes may be omitted, for example a[:] returns a copy of 
the entire list. Each element of a slice is a shallow copy. 

In Python, a distinction between expressions and statements is rigidly enforced, in 
contrast to languages such as Common Lisp, Scheme, or Ruby. This leads to duplicating some 
functionality. 

Statements cannot be a part of an expression, so list and other comprehensions or lambda 
expressions, all being expressions, cannot contain statements. A particular case of this is that 
an assignment statement such as a = 1 cannot form part of the conditional expression of a 
conditional statement. This has the advantage of avoiding a classic C error of mistaking an 
assignment operator = for an equality operator == in conditions: if (c = 1) { ... } is valid C 
code but if c = 1: ... causes a syntax error in Python. 


24 Mathematics 


Python has the usual C arithmetic operators (+,—, *, /, %). It also has ** for 
exponentiation, e.g. 5**3 == 125 and 9**0.5 == 3.0, and a new matrix multiply @ operator is 
included in version 3.5. 

Python provides a round function for rounding a float to the nearest integer. For 
tie-breaking, versions before 3 use round-away-from-zero: round(0.5) is 1.0, round(-0.5) is 
一 1.0. Python 3 uses round to even: round(1.5) is 2, round(2.5) is 2. 

Python allows boolean expressions with multiple equality relations in a manner that is 
consistent with general use in mathematics. For example, the expression a < b < c tests 
whether a is less than b and b is less than c. C-derived languages interpret this expression 
differently: in C, the expression would first evaluate a < b, resulting in 0 or 1, and that result 
would then be compared with c. 

Due to Python's extensive mathematics library, it is frequently used as a scientific 
scripting language to aid in problems such as numerical data processing and manipulation. 


3. Libraries 


Python has a large standard library, commonly cited as one of Python's greatest 
strengths, providing tools suited to many tasks. For Internet-facing applications, many 
standard formats and protocols (such as MIME and HTTP) are supported. Modules for 
creating graphical user interfaces, connecting to relational databases, pseudorandom number 
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generators, arithmetic with arbitrary precision decimals, manipulating regular expressions, 
and doing unit testing are also included. 

The standard library is not needed to run Python or embed it in an application. For 
example, Blender 2.49 omits most of the standard library. 

As of January 2016, the Python Package Index, the official repository of third-party 
software for Python, contains more than 72000 packages offering a wide range of 
functionality, including: 

* graphical user interfaces, web frameworks, multimedia, databases, networking and 

communications 

e test frameworks, automation and web scraping, documentation tools, system 

administration 

scientific computing, text processing, image processing 


4. Development environments 


Most Python implementations can function as a command line interpreter, for which the 
user enters statements sequentially and receives the results immediately (read-eval-print loop 
(REPL)). In short, Python acts as a command-line interface or shell. 

Other shells add abilities beyond those in the basic interpreter, including IDLE and 
IPython. While generally following the visual style of the Python shell, they implement 
features like auto-completion, session state retention, and syntax highlighting. 

In addition to standard desktop integrated development environments (Python IDEs), 
there are also web browser-based IDEs, Sage (intended for developing science and math-related 
Python programs), and a browser-based IDE and hosting environment, PythonAnywhere. 


XW New Words 

high-level [hai-level] adj. 高 级 的 
general-purpose ['dzeneral'pa:pes] adj. 5 # fl 3 th 
philosophy [filosefi] nn. 暂 学 ， 哲 学 体系 
emphasize [emfasaiz] wi. 强调 ， 着 重 
readability Lri:də'biliti] 中. 易 读 ， 可 读 性 
programmer [preugraeme] n FE Fr i 
imperative [im'peretiv] ne 

adj. 命 令 的 
unclutter [An'klate] L2 23 ME 


exception [ik'sepJ en] .异常 ,例外 
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-— 


indentation Linden'teif en] 7. 缩 排 
keyword [ki:wad] DES Fe 
statement ['steitmant] .语句 
declared [di'klead] adj. 9i t] 
improper [im'prope] adj. 不 适当 的 ,不 合适 的 ， 不 正确 的 
string [strin] nie 
successive [sak sesiv] a 太 连续 的 
contraction [ken'treekf en] .缩写 式 ， 紧 缩 
iterate ['itereit] WV 重复 
clause [klo:z] n^ 
namespace [neimspeis] 7. 名 空间 
class [kla:s] n.X 
enclose [in'Kleuz] vt. E 
manager [maenidza] .管理 器 
debugging [di:'bagin] .调试 
generator [‘dgenareita] n.E RB 
coroutine [Lkeru:'ti:n] nth el BF 
expression [iks'pref en] .表达 式 
exponentiation [ekspeu;nenfieifen] n RF 
operator [‘opereite] n. 运 算 符 
infix [in'fiks] n.PA 

vtàb-- dii 
mutable ['mju:tebl] adj. 可 变 的 ， 易 变 的 
immutable [imju:tabl] adj. 不 可 变 的 ,不 能 变 的 
tuple [t^pl] .元 组 
parentheses [pearengasi:z] 17. 圆 括号 
optional [opJanal] adj. 可 选择 的 
context [kontekst] 7 上下文 ， 情 景 
analogous [ə'næləgəs] adj. 类 似 的 ， 相 似 的 
literal [literal] adj. 文 字 的 ， 照 字面 上 的 
backslash ['beeksleef ] n BABS 
prefix [pri:fiks] .前缀 
parameter [pe'reemita] nBR, SE 
reverse [ri ve:s] D LENS 
omit [eu'mit] VL. 省略 ， 和 遗漏 
rounding [raundin] n&r, WE 


module [modju:l] nA 
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pseudorandom Lpsju:deu'reendem] adj. 伪 随机 的 
multimedia [maltiimi:dia] nn. 多 媒体 
communication [ka.mju:ni'keifn] .通信 

XA Phrases 
dynamic programming language 动态 编程 语言 
design philosophy 设计 原理 
lines of code 代码 行 
memory management 内 存 管理 
standard library 标准 库 
third-party tool 第 三 方 工具 
structured programming 结构 化 编程 
aspect-oriented programming 面向 切面 编程 
design by contract 契约 设计 
logic programming 逻辑 编程 
special case 特殊 情况 
curly braces 大 括号 ， 花 括号 
off-side rule 越位 规则 
control flow 控制 流 
assignment statement 赋值 语句 
symbolic address 符号 地 址 
storage location 存储 位 置 ， 存 储 单元 
matrix multiplication 矩阵 乘法 
library function 库 函 数 
object identity 对 象 标识 
boolean operator 布尔 运算 符 ， 逻 辑 运算 符 
list comprehension 列表 解析 ， 列 表 推导 
generator expression 生成 器 表达 式 
anonymous function 匿名 函数 
lambda expression 入 表达 式 
conditional expression 条 件 表 达 式 
make a distinction between... 对 …… 加 以 区 别 
string literal 字符 串 字 面 量 
single quote mark 单 引号 
double quote mark 双 引 号 


escape character 转 义 字符 


string interpolation 
triple-quoted string 
regular expression 

array index 

array slicing 

shallow copy 

particular case 

have the advantage of 
boolean expression 

in a manner 
pseudorandom number generator 
web scraping 

image processing 
development environment 
command line interpreter 
session state retention 
hosting environment 


XA Abbreviations 


RAII (Resource Acquisition Is Initialization) 
NOP (No Operation) 

MIME (Multipurpose Internet Mail Extensions) 
HTTP (HyperText Transfer Protocol) 

REPL (Read-Eval-Print Loop) 


XA Exercises 


【Ex. 5】 根据 课文 内 容 回答 问题 。 
1. What is Python? 
2. What are the key features of Python? 
3. What does the if statement conditionally do? 
4. What does the try statement do? 
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字符 串 插值 

三 重 引号 字符 串 

正则 表达 式 

数组 下 标 

数组 切片 

EX 

特别 情况 ， 特 例 

胜 过 
BRAK, 布尔 表达 式 
在 某 种 意义 上 

伪 随 机 数 产生 器 
ABM, ABR EMM 
图 像 加 工 ， 图 像 处 理 
开发 环境 
命令 行 解释 程序 

会 话 状态 保留 

托管 环境 


资源 获得 即 初始 化 

无 操作 

多 用 途 互联 网 邮件 扩展 
超 文本 传输 协议 

读 取 - 求 值 -打印 循环 


5. What does Python 3.5 support matrix multiplication with? 

6. What distinction does Python make between lists and tuples? 
7. What do triple-quoted strings begin and end with? 

8. What is the third slice parameter called? What does it do? 
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9. What is one of Python's greatest strengths? 
10. What can most Python implementations function as? 


参考 译文 


R 编程 语言 


RR 语言 是 用 于 统计 计算 和 图 形 的 编程 语言 和 软件 环境 ， 由 统计 计算 R 基金 会 支持 。 
统计 学 家 和 数据 挖掘 者 广泛 使 用 了 语言 来 开发 统计 软件 和 分 析 数 据 。 民意 测验 、 对 数据 
挖掘 者 的 调查 以 及 对 学 术 文 献 数 据 库 研究 都 显示 R 语言 近年 来 的 受 欢 迎 程度 大 大 提高 。 

RR 语言 是 一 个 GNU 项 目 。 用 于 R 软件 环境 的 源 代码 主要 用 C 语言 、Fortran 语言 和 
RR 语言 来 编写 。R 语言 可 以 持 GNU 通用 公共 许可 证 免费 获得 ， 也 提供 用 于 各 种 操作 系 
统 的 预 编 译 二 进 制版 本 。 虽 然 R 语言 有 一 个 命令 行 界面 , 但 也 有 几 个 图 形 前 端 可 供 使 用 。 


1. 历史 


RR 语言 是 S 语言 与 词汇 作用 域 语义 的 结合 。S 语言 由 John Chambers 在 贝尔 实验 室 
创建 。 虽然 两 者 有 一 些 重要 的 区 别 ， 但 为 S$ 编写 的 大 部 分 代码 无 须 修改 即 可 运行 。 

及 语言 由 新 西 兰 奥克兰 大 学 的 罗斯 。 伊 哈 卡 (Ross Ihaka) 和 罗伯特 ， 杰 特 曼 (Robert 
Gentleman) 创建 ， 目 前 由 钱 伯 斯 (Chambers) 所 在 的 R 语言 开发 核心 团队 开发 。 该 项 
目 于 1992 年 构思 ， 最 初版 本 于 1994 年 发 布 ，2000 年 发 布 了 稳定 的 beta 版 本 。 


2. 统计 功能 


及 语言 及 其 库 采 用 了 各 种 统计 和 图 形 技术 ,包括 线性 和 非 线性 建 模 、 经 典 统计 测试 、 
时 间 序 列 分 析 、 分 类 及 聚 类 等 。R 语言 可 以 通过 函数 和 扩展 部 件 轻松 扩充 ，R 社区 在 软 
件 包 方面 的 积极 贡献 值得 注意 。R 语言 的 许多 标准 函数 都 是 用 R 语言 编写 的 ， 这 使 得 用 
户 可 以 轻松 地 实现 所 选择 算法 。 对 于 计算 量 大 的 任务 , 可 以 在 运行 时 链接 和 调用 C. C ++ 
和 Fortran 代码 。 高 级 用 户 可 以 编写 C、C 4. Java. .NET 或 Python 代码 来 直接 处 理 R 
对 象 。R 语言 使 用 用 户 提交 的 软件 包 极 大 地 扩展 了 其 应 用 ， 这 些 软件 包 可 用 于 特定 函数 
或 特定 的 研究 领域 。 由 于 其 继承 了 S 语言 ， 因 而 当 采 用 面向 对 象 编程 技术 时 ，R 语言 比 
大 多 数 统计 计算 语言 更 便利 。 可 以 通过 词法 作用 域 规则 轻松 扩展 RR 语言 。 

RR 语 言 的 男 一 个 优点 是 静态 图 形 处 理 能 力 强 ， 可 以 生成 出 版 品质 的 图 形 ， 包 括 数学 
符号 。 使 用 其 他 软件 包 也 可 以 处 理 动态 和 交互 式 图 形 。 

及 语言 有 其 自身 的 LaTeX 类 文档 格式 Rd， 广泛 支持 各 类 文档 ， 既 支持 多 种 在 线 格 
式 ， 也 支持 硬 拷贝 。 
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3. 编程 功能 


了 语言 是 一 种 解释 语言 ,用 户 通常 通过 命令 行 解释 器 访问 它 。 如 果 用 户 在 R 命令 提 
示 符 下 键入 “2 + 2” 并 按 回 车 键 ， 则 计算 机 将 回复 4。 

像 其 他 类 似 的 语言 (如 APL RI MATLAB) 一 样 ，R 语言 支持 矩阵 运算 。R 语言 的 
数据 结构 包括 向 量 、 窍 阵 、 数 组 、 数 据 帧 〈 类 似 于 关系 数据 库 中 的 表 ) 和 列表 。R 语言 
的 可 扩展 对 象 系统 包括 用 于 回归 模型 、 时 间 序 列 和 地 理 空间 坐标 〈 以 及 其 他 ) 的 对 象 。 
标量 数据 类 型 从 不 是 R 语言 的 数据 结构 。 相 反 ， 标 量 被 表示 为 长 度 为 1 的 向 量 。 

及 语言 支持 带 函 数 的 过 程 化 编程 ， 对 于 一 些 函数 ， 也 可 用 于 带 类 函数 的 面向 对 象 编 
程 。 类 函数 根据 给 其 传递 的 参数 类 别 而 有 所 不 同 。 换 句 话说 ， 类 函数 为 特定 的 类 对 象 指 
定 特定 的 函数 〈 或 方法 ) 。 例 如 ，R 语言 具有 通用 print 功能 ， 可 以 使 用 简单 的 print 
Cobjectname) 语法 在 R 语言 中 打印 几乎 每 一 类 对 象 。 

虽然 及 语言 主要 由 统计 学 家 和 其 他 从 业者 用 于 统计 计算 和 软件 开发 的 环境 , 但 也 可 
以 用 作 一 般 矩 阵 计算 工具 箱 ， 其 性 能 基准 与 GNU Octave 或 MATLAB 相当 。 


4. 软件 包 


可 通过 用 户 创建 的 软件 包 来 扩展 RR 语言 的 性 能 , 这 些 软件 包 专 用 于 统计 技术 、 图 形 
设备 (如 Hadley Wickham 开发 的 ggplot2 包 )、 导 入 /导出 功能 ,报告 工具 (knitr、Sweave) 
等 方面 。 这 些 软件 包 主 要 用 R 语言 开发 ， 有 时 也 用 Java、C、C ++ 和 Fortran 来 开发 。 

及 语言 的 安装 包括 一 套 核心 套件 ,还 有 超过 7801 个 附加 软件 包 (截至 2016 年 1 月 )， 
这 些 附加 软件 包 用 于 CRAN CR 综合 归档 网 ) Bioconductor (生物 导体 ) 、Omegahat、 
GitHub 和 其 他 软件 库 。 

CRAN 网 站 上 的 “任务 视图 ”页 面 (主题 列表 ) 列 出 了 可 用 R 语言 完成 的 各 种 任务 

(诸如 金融 、 遗 传 学 、 高 性 能 计算 、 机 器 学 习 、 医 学 影像 、 社 会 科学 和 空间 统计 学 等 领 
域 ) 和 可 用 的 软件 包 。R 语言 已 经 被 FDA 认定 为 适合 解释 临床 研究 数据 。 

其 他 尺 语言 包 资 源 包括 Crantastic( 用 于 评估 和 审查 所 有 CRAN 软件 包 的 社区 网 站 ) 
以 及 R-Forge (这 是 及 语言 软件 包 以 及 R 语言 相关 软件 和 项 目 协同 开发 的 中 心平 台 ) 。 
R-Forge 还 提供 许多 未 发 布 的 beta 测试 版 软件 包 和 CRAN 软件 包 的 开发 版 本 。 


5. 界面 


51 图 形 用 户 界面 


* Architect 一 一 基于 Eclipse 和 StatET 的 数据 科学 的 跨 平台 开源 IDE. 
© DataJoy 一 一 online R Editor 专注 于 数据 科学 与 协作 的 初学 者 。 
* Deducer 一 一 用 于 菜单 驱动 数据 分 析 的 GUI. 类似 于 SPSS / IMP / Minitab) 。 
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e Java GUI for R 一 一 跨 平台 的 独立 R 终端 和 基于 Java OBERY JGR) 的 编辑 器 。 

* Number Analytics 一 一 基于 云 的 业务 分 析 〈 类 似 于 SPSS) 的 GUI. 

e Rattle GUI 一 一 基于 RGtk2 的 跨 平台 GUI， 专 为 数据 挖掘 而 设计 。 

* R Commander 一 一 基于 tcltk 的 跨 平台 菜单 驱动 的 GUI (也 可 以 使 用 几 个 Remdr 
插件 ) 。 

e Revolution R Productivity Environment (RPE) 一 一 提供 Revolution Analytics 的 基于 
Visual Studio 的 IDE， 并 具有 基于 网 页 的 页 面 操作 界面 。 

e RGUI 一 一 带 有 用 于 Microsoft Windows 的 预 编译 版 本 。 

e RKWard 一 一 用 于 R 的 可 扩展 GUI fil IDE. 

* RStudio 一 一 跨 平 台 开 源 IDE (也 可 以 在 远程 Linux 服务 器 上 运行 ) 。 


5.2 编辑 器 和 IDE 
支持 及 语言 的 文本 编辑 器 和 集成 开发 环境 (IDE) 包 括 ConTEXT、Eclipse(StatET)、 


Emacs (Emacs Speaks Statistics), LyX (用 于 knitr 和 Sweave 的 模块 ) Vim, jEdit, 
Kate, RStudio, Sublime Text, TextMate. Atom, WinEdt (R Package RWinEdt) Tinn-R. 
Notepad ++ fil Architect. 


脚本 


53 ”脚本 语言 
可 以 从 几 种 脚本 语言 (如 Python, Perl, Ruby, F##ll Julia) 访问 RR 功能。 R 本 身 的 
可 以 通过 一 个 名 叫 “小 可 爱 ” (littler) 的 前 端 访 问 。 


6. 与 SAS、SPSS 和 Stata 的 比较 


更 优 


认可 


一 个 共识 是 : 与 其 他 受 欢迎 的 统计 软件 包 (如 SAS. SPSS 和 Stata) 相 比 ，R 语言 
异 。 对 统计 软件 的 所 有 基本 特色 功能 进行 比较 后 ， 人 们 认为 R 是 最 好 的 统计 软件 。 
2015 年 1 月 ， 《纽约 时 报 》 刊 登 的 一 篇 文章 指出 : R 语言 在 数据 分 析 师 中 获得 广泛 
， 并 对 SAS 等 商业 统计 软件 所 占据 的 市 场 份额 构成 了 威胁 。 


Unit 5 


Text A 


Data Structure 


A data structure is a specialized format for organizing and storing data. General data 
structure types include the array, the file, the record, the table, the tree, and so on. Any data 
structure is designed to organize data to suit a specific purpose so that it can be accessed and 
worked with in appropriate ways. In computer programming, a data structure may be selected 
or designed to store data for the purpose of working on it with various algorithms. 


1. Array 


(1) In data storage, an array is a method for storing information on multiple devices. 

(2) In general, an array is a number of items arranged in some specified way, for example, 
in a list or in a three-dimensional table. 

(3) In computer programming languages, an array is a group of objects with the same 
attributes that can be addressed individually, using such techniques as subscripting. 

(4) In random access memory (RAM), an array is the arrangement of memory cells. 


2. File 


(1) In data processing, a file is a related collection of records. For example, you might 
put the records you have on each of your customers in a file. In turn, each record would 
consist of fields for individual data items, such as customer name, customer number, customer 
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address, and so forth. By providing the same information in the same fields in each record so 
that all records are consistent, your file will be easily accessible for analysis and manipulation 
by a computer program. This use of the term has become somewhat less important with the 
advent of the database and its emphasis on the table as a way of collecting record and field 
data. In mainframe systems, the term data set is generally synonymous with file but implies a 
specific form of organization recognized by a particular access method. Depending on the 
operating system, files (and data sets) are contained within a catalog, directory, or folder. 

(2) In any computer system, especially in personal computers, a file is an entity of data 
available to system users (including the system itself and its application programs) that is 
capable of being manipulated as an entity (for example, moved from one file directory to 
another). The file must have a unique name within its own directory. Some operating systems 
and applications describe files with given formats by giving them a particular file name suffix. 
The file name suffix is also known as a file name extension. For example, a program or 
executable file is sometimes given or required to have an“ .exe” suffix. In general, the suffixes 
tend to be as descriptive of the formats as they can within the limits of the number of 
characters allowed for suffixes by the operating system. 


3. Record 


(1) In computer data processing, a record is a collection of data items arranged for 
processing by a program. Multiple records are contained in a file or data set. The organization 
of data in the record is usually prescribed by the programming language that defines the 
record’s organization and/or by the application that processes it. Typically, records can be of 
fixed-length or be of variable length with the length information contained within the record. 

(2) In a database, a record, sometimes called a row, is a group of fields within a table that 
are relevant to a specific entity. For example, in a table called customer contact information, a 
row would likely contain fields such as: ID number, name, street address, city, telephone 
number and so on. 


4. Table 


In computer programming, a table is a data structure used to organize information, just as 
it is on paper. There are many different types of computer-related tables, which work in a 
number of different ways. The following are examples of the more common types. 

(1) In data processing, a table, also called an array, is an organized grouping of fields. 
Tables may store relatively permanent data, or may be frequently updated. For example, a 
table contained in a disk volume is updated when sectors are being written. 
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(2) In a relational database, a table, sometimes called a file, organizes the information 
about a single topic into rows and columns. For example, a database for a business would 
typically contain a table for customer information, which would store customers’ account 
numbers, addresses, phone numbers, and so on as a series of columns. Each single piece of 
data, such as the account number, is a field in the table. A column consists of all the entries in 
a single field, such as the telephone numbers of all the customers. Fields, in turn, are 
organized as records, which are complete sets of information, such as the set of information 
about a particular customer, each of which comprises a row. The process of normalization 
determines how data will be most effectively organized into tables. 

(3) A decision table, often called a truth table, which can be computer-based or simply 
drawn up on paper, contains a list of decisions and the criteria on which they are based. All 
possible situations for decisions should be listed, and the action to take in each situation 
should be specified. A rudimentary example: For a traffic intersection, the decision to proceed 
might be expressed as yes or no and the criteria might be the light is red or the light is green. 

A decision table can be inserted into a computer program to direct its processing 
according to decisions made in different situations. Changes to the decision table are reflected 
in the program. 

(4) An HTML table is used to organize Web page elements spatially or to create a 
structure for data that is best displayed in tabular form, such as lists or specifications. 


XW New Words 
specialized ['spef alaizd] adj. 专 用 的 ， 专 门 的 
organize [‘o:ganaiz] VI 组织 ; 构成 ， 组 成 
array [arei] nn. 数组 ， 排 列 
record [reko:d] .记录 
[rii'ko:d] Vi 记录; 录音 
table ['teibl] n, E 
appropriate [e'preupriet] adf. 正 确 的 ， 适 当 的 
various [vsarias] adf. 不 同 的 ， 各 种 各 样 的 ， 多 方面 的 ， 多 样 的 
subscript [s^bskript] adj. Fs 
collection [ke'lekf on] nn 集合 ， 收 集 来 的 总 和 
item [aitam] nJA B 
consistent [ken'sistent] adj. 一 致 的 ， 调 和 的 ， 相 容 的 
accessible [ek'sesabl] adf. 易 接近 的 ， 可 访问 的 ， 易 受 影响 的 


manipulation [ma.nipju'leif en] 17. 处 理 ， 操 作 
advent [aedvant] .出现 ， 到 来 
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emphasis ['emfesis] 17. 强调 ， 重 点 

imply [im'plai] WV. 暗示， 意味 

suffix ['s^fiks] 7. 后 级 ; 下 标 

prescribe [pris'kraib] vim, ME 

define [difain] Vi 定义， 详细 说 明 

length [len8] .长度 

row [rau] nAt, 4 

relevant [relivənt] adj. 有 关 的 ， 相 应 的 

common [komen] adj. 共 同 的 ， 公 共 的 ， 公 有 的 ， 普 通 的 

permanent [pa:manant] adj. AAW, FAW 

frequently ['fri:kwantli] adv. 常 常 ， 频 繁 地 

Sector [sekte] n. IX 

normalization — [no:melaizeifen] 7 规范 化 ,正常 化 ， 标 准 化 

criteria [krai'tieria] .标准 

rudimentary [ru:di'menteri] ad AAW, 初步 的 

intersection [inte:'sekf an] 1. 交集， 十字 路 口 ， 交 又 点 

spatial ['speif al] adj. 空 间 的 ， 立 体 的 ， 三 维 的 
XA Phrases 

data structure 数据 结构 

and so on 等 等 

anumber of 许多 的 

three-dimensional table 三 维 表 

memory cell 内 存单 元 

and so forth 等 等 

data set 数据 集 

file name extension 文件 扩展 名 

account number 账号 

in tum 依次 ， 轮 流 

decision table 判定 表 ， 决 策 表 

truth table 真 值 表 

draw up 草拟 


X Abbreviations 


ID (Identification, Identity ) 身份 
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XA Notes 


[1] Any data structure is designed to organize data to suit a specific purpose so that it can be 
accessed and worked with in appropriate ways. 
本 句 中 ，to organize data to suit a specific purpose so that it can be accessed and worked 
with in appropriate ways 是 一 个 动词 不 定式 短语 ， 作 目的 状语 ， 修 饰 is designed. E 
该 短语 中 ，to suit a specific purpose 也 是 一 个 动词 不 定式 短语 ， 作 目的 状语 ， 修 饰 to 
organize, so that it can be accessed and worked with in appropriate ways 是 一 个 目的 状 
语 从 句 。 

[2] In computer programming languages, an array is a group of objects with the same 
attributes that can be addressed individually, using such techniques as subscripting. 
本 名 中 ，with the same attributes 是 一 个 介词 短语 ， 作 定语 ， 修 饰 和 限定 a group of 
objects. that can be addressed individually 是 一 个 定语 从 旬 ， 也 修饰 和 限定 a group of 
objects，using such techniques as subscripting 是 一 个 现在 分 词 短语 ， 作 方式 状语 ， 修 
饰 从 句 的 谓语 can be addressed。 

[3] By providing the same information in the same fields in each record so that all records are 
consistent, your file will be easily accessible for analysis and manipulation by a computer 


program. 
4x4), By providing the same information in the same fields in each record so that all 
records are consistent 是 一 个 现在 分 词 短语 ， 作 方式 状语 ， 修 饰 谓 语 will be easily 
accessible。 在 该 短语 中 ，so that all records are consistent 是 一 个 结果 状语 从 句 ， 修 饰 
谓语 providing. 

英语 中 ，so that 既 可 以 引导 一 个 目的 状语 从 句 ， 也 可 以 引导 一 个 结果 状语 从 句 。 请 
看 下 例 : 

We asked the professor to speak louder so that we could hear him. 

我 们 请 教授 讲话 声 再 大 一 些 ， 以 便 让 我 们 能 听 清 。( 目 的 状语 从 句 ) 

Mary didn’t plan her time well, so that she didn’t finish the work in time. 

玛丽 没有 把 时 间 计 划 好 ， 结 果 没有 按时 完成 这 项 工作 。( 结 果 状 语 从 句 ) 

[4] In any computer system but especially in personal computers, a file is an entity of data 
available to system users (including the system itself and its application programs) that is 
capable of being manipulated as an entity (for example, moved from one file directory to 
another). 
本 句 中 ,available to system users 是 一 个 现在 分 词 短 语 , 作 定 语 , 修饰 和 限定 an entity 
of data. that is capable of being manipulated as an entity 是 一 个 定语 从 句 ， 也 修饰 和 限 
定 an entity of data。 

[5] The organization of data in the record is usually prescribed by the programming language 
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that defines the record’s organization and/or by the application that processes it. 
AAI, and/or 连接 了 by 引导 的 两 个 方式 状语 。that defines the record’s organization 
是 一 个 定语 从 句 ， 修 饰 和 限定 the programming language。 

[6] For example, a database for a business would typically contain a table for customer 


information, which would store customers’ account numbers, addresses, phone numbers, 


and so on as a series of columns. 


本 句 中 ，which would store customers’ account numbers, addresses, phone numbers, and 
so on as a series of columns 是 一 个 非 限定 性 定语 从 句 ， 对 a table for customer 
information 进行 补充 说 明 。 

[7] Fields, in turn, are organized as records, which are complete sets of information, such as 
the set of information about a particular customer, each of which comprises a row. 
4x4), which are complete sets of information 是 一 个 非 限 定性 定语 从 句 ， 对 records 
进行 补充 说 明 。such as the set of information about a particular customer 是 对 complete 
sets of information 的 举例 说 明 ，each of which comprises a row 是 一 个 非 限 定性 定语 从 
句 ， 对 the set of information about a particular customer 进行 补充 说 明 。 
英语 中 ， 定 语 从 句 还 可 以 由 名 词 〈 代 词 / 数 词 ) + of + which (whom) 来 引导 ， 表 示 
部 分 与 整体 的 关系 。 注 意 不 要 误 用 which 和 whom。which 指 物 ，whom 用 来 指 人 。 
请 看 下 例 : 

Peter’s father knows a lot of people, many of whom are professors. 
彼得 的 爸爸 认识 许多 人 ， 其 中 许多 是 教授 。 

She bought many books yesterday, five of which are on ERP. 

她 昨天 买 了 许多 书 ， 其 中 5 本 是 ERP 方面 的 。 


XA Exercises 


[Ex 1】 根据 课文 内 容 回答 问题 。 

1. What is a data structure? 

2. What is an array in computer programming languages? 

3. Must the file have a unique name within its own directory? 

4. How do some operating systems and applications describe files with given formats? 
5. How is the organization of data in the record usually prescribed? 
6. What is a table in computer programming? 

7. What is a table in data processing? 

8. What is a table in a relational database? 

9. What is a decision table often called? What does it contain? 

10. What is an HTML table used to do? 


| Unit 5 (er) 


[Ex 2] 根据 下 面 的 英文 解释 ， 写 出 相应 的 英文 词汇 。 
: A signal to a computer that stops the execution of a running program so that 
another action can be performed. 
2: : A collection of related, often adjacent items of data, treated as a unit. 
: In word processing, a block of text formatted in aligned rows and columns. 
4. : A multi-element data structure that has a linear organization but that allows 
elements to be added or removed in any order. 
3: : A distinguishing character or symbol written directly beneath or next to and 
slightly below a letter or number. 
: An affix added to the end of a word or stem. 
: To make or write a definition. 
: A series of objects placed next to each other, usually in a straight line. 
: A bit or a set of bits on a magnetic storage device making up the smallest 
addressable unit of information. 


Sw 


10. : To organize data, typically a set of records, in a particular order. 


【Ex. 3】 把 下 列 句 子 翻译 为 中 文 。 

1. Star topologies are normally implemented using twisted pair cable, specifically unshielded 
twisted pair (UTP). 

2. A video card is the part of your computer that transforms video data into the visual display 
you see on your monitor. 

3. A multi-user operating system allows many different users to take advantage of the 
computer’s resources simultaneously. 

4. Address is the unique location of an information site on the Internet, a specific file (for 
example, a Web page), or an E-mail user. 

5. Over the years, ARPA has funded many projects in computer science research, many of 
which had a profound effect on the state of the art. 

6. In truth of course by making the creation of more complex software practical, computer 
languages have merely created new types of software bugs. 

7. A computer virus is a program designed to spread itself by first infecting executable files or 
the system areas of hard and floppy disks and then making copies of itself. 

8. When the entire RAM is being used (for example if there are many programs open at the 
same time) the computer will swap data to the hard drive and back to give the impression 
that there is slightly more memory. 

9. The compiler ignores all comments. 

10. You can E-mail your document without ever leaving word. 
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【Ex.4】 将 下 列 词 填 入 适当 的 位 置 (每 词 只 用 一 次 )。 


records leaves beyond random database 
two power depth nodes end 


A binary tree is a method of placing and locating files (called records or keys) in 
a. (1) ,especially when all the data is known to be in _ (2) | access memory (RAM). 
The algorithm finds data by repeatedly dividing the number of ultimately accessible _ (3) | in 
half until only one remains. 

In a tree, records are stored in locations called _ (4) . This name derives from the fact 
that records always exist at — (5) points; there is nothing — (6) _ them. Branch points 
are called — (7) . The order of a tree is the number of branches (called children) per node. 
Ina binary tree, there are always __(8) | children per node, so the order is 2. The number of 
leaves in a binary tree is always a _ (9) _ of 2. The number of access operations required to 
reach the desired record is called the _ (10) — ofthe tree. 


Text B 


Structured Data, Semi-structured 
Data and Unstructured Data 


1. Structured Data 


Structured data refers to any data that resides in a fixed field within a record or file. This 
includes data contained in relational databases and spreadsheets. 


1.1 Characteristics of Structured Data 


Structured data first depends on creating a data model — a model of the types of business 
data that will be recorded and how they will be stored, processed and accessed. This includes 
defining what fields of data will be stored and how that data will be stored: data type (numeric, 
currency, alphabetic, name, date, address) and any restrictions on the data input (number of 
characters; restricted to certain terms such as Mr., Ms. or Dr; M or F). 

Structured data has the advantage of being easily entered, stored, queried and analyzed. 
At one time, because of the high cost and performance limitations of storage, memory and 


processing, relational databases and spreadsheets using structured data were the only ways to 
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effectively manage data. Anything that couldn't fit into a tightly organized structure would 
have to be stored on paper in a filing cabinet. 


1.2 Managing Structured Data 


Structured data is often managed using Structured Query Language (SQL) - a 
programming language created for managing and querying data in relational database 
management systems. Originally developed by IBM in the early 1970s and later developed 
commercially by Relational Software, Inc. (now Oracle Corporation). 

Structured data was a huge improvement over strictly paper-based unstructured systems, 
but life doesn’t always fit into neat little boxes. As a result, the structured data always had to 
be supplemented by paper or microfilm storage. As technology performance has continued to 
improve, and prices have dropped, it was possible to bring into computing systems 
unstructured and semi-structured data. 


1.3 Structured Data Technology Standards 


SQL has been a standard of the American National Standards Institute since 1986. It is 
managed by InterNational Committee for Information Technology Standards (INCITS) 
Technical Committee DM 32 — Data Management and Interchange. The committee has two 
task groups, one for databases and the other for metadata. HP, CA, IBM, Microsoft, Oracle, 
Sybase (SAP) and Teradata all participate, as well as several federal government agencies. 
Both of the committee project documents have links to further information on each project. 
SQL became an International Organization for Standards (ISO) standard in 1987. The 
published standards are available for purchase from the ANSI eStandards Store, under the 
INCITS/ISO/IEC 9075 classification. 


2. Semi-structured Data 


Semi-structured data is a form of structured data that does not conform with the formal 
structure of data models associated with relational databases or other forms of data tables, but 
nonetheless, contains tags or other markers to separate semantic elements and enforce 
hierarchies of records and fields within the data. Therefore, it is also known as self-describing 
Structure. 

In semi-structured data, the entities belonging to the same class may have different 
attributes even though they are grouped together, and the attributes’ order is not important. 

Semi-structured data are increasingly occurring since the advent of the Internet where 
full-text documents and databases are not the only forms of data anymore, and different 
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applications need a medium for exchanging information. In object-oriented databases, one 
often finds semi-structured data. 


2.1 Types of Semi-structured data 


2.1.1 XML 
XML, other markup languages, email, and EDI are all forms of semi-structured data. 

OEM (Object Exchange Model) was created prior to XML as a means of self-describing a 
data structure. XML has been popularized by web services that are developed utilizing SOAP 
principles. 

Some types of data described here as“ semi-structured” , especially XML, suffer from the 
impression that they are incapable of structural rigor at the same functional level as Relational 
Tables and Rows. Indeed, the view of XML as inherently semi-structured (previously, it was 
referred to as “unstructured” ) has handicapped its use for a widening range of data-centric 
applications. Even documents, normally thought of as the epitome of semi-structure, can be 
designed with virtually the same rigor as database schema, enforced by the XML schema and 
processed by both commercial and custom software programs without reducing their usability 
by human readers. 

In view of this fact, XML might be referred to as having “flexible structure” capable of 
human-centric flow and hierarchy as well as highly rigorous element structure and data 
typing. 

2.1.2 JSON 

JSON or JavaScript Object Notation, is an open standard format that uses human- 
readable text to transmit data objects consisting of attribute—value pairs. It is used primarily to 
transmit data between a server and web application, as an alternative to XML. JSON has been 
popularized by web services developed utilizing REST principles. 

There is a new breed of databases such as MongoDB and Couchbase that store data 
natively in JSON format, leveraging the pros of semi-structured data architecture. 


2.2 Pros and Cons of Using a Semi-structured Data Format 


2.2.1 Advantages 

* Programmers persisting objects from their application to a database do not need to 
worry about object-relational impedance mismatch, but can often serialize objects via 
a light-weight library. 

e Support for nested or hierarchical data often simplifies data models representing 
complex relationships between entities. 

e Support for lists of objects simplifies data models by avoiding messy translations of 
lists into a relational data model. 
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2.2.2 Disadvantages 

e The traditional relational data model has a popular and ready-made query language, 
SQL. 

Prone to “garbage in, garbage out” ; by removing restraints from the data model, there 
is less fore-thought that is necessary to operate a data application. 


3. Unstructured Data 


Unstructured data (or unstructured information) refers to information that either does not 
have a pre-defined data model or is not organized in a pre-defined manner. Unstructured 
information is typically text-heavy, but may contain data such as dates, numbers, and facts as 
well. This results in irregularities and ambiguities that make it difficult to understand using 
traditional programs as compared to data stored in fielded form in databases or annotated 
(semantically tagged) in documents. 

In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80%-90% of all 
potentially usable business information may originate in unstructured form. This rule of 
thumb is not based on primary or any quantitative research, but nonetheless is accepted by 
some. 

IDC and EMC project that data will grow to 40 zettabytes by 2020, resulting in a 50-fold 
growth from the beginning of 2010. The Computer World magazine states that unstructured 
information might account for more than 70%-80% of all data in organizations. 


3.1 Background 


The earliest research into business intelligence focused in on unstructured textual data, 
rather than numerical data. As early as 1958, computer science researchers like H.P. Luhn 
were particularly concerned with the extraction and classification of unstructured text. 
However, only since the turn of the century has the technology caught up with the research 
interest. In 2004, the SAS Institute developed the SAS Text Miner, which uses Singular Value 
Decomposition (SVD) to reduce a hyper-dimensional textual space into smaller dimensions 
for significantly more efficient machine-analysis. The mathematical and technological 
advances sparked by machine textual analysis prompted a number of business to research 
applications, leading to the development of fields like sentiment analysis, voice of the 
customer mining, and call center optimization. The emergence of Big Data in the late 2000s 
led to a heightened interest in the applications of unstructured data analytics in contemporary 
fields such as predictive analytics and root cause analysis. 
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3.2 Issues with terminology 


The term is imprecise for several reasons: 

(1) Structure, while not formally defined, can still be implied. 

(2) Data with some form of structure may still be characterized as unstructured if its 
structure is not helpful for the processing task at hand. 

(3) Unstructured information might have some structure (semi-structured) or even be 
highly structured but in ways that are unanticipated or unannounced. 


3.3 Dealing with unstructured data 


Techniques such as data mining, natural language processing (NLP), and text analytics 
provide different methods to find patterns in, or otherwise interpret, this information. 
Common techniques for structuring text usually involve manual tagging with metadata or 
part-of-speech tagging for further text mining-based structuring. The Unstructured 
Information Management Architecture (UIMA) standard provided a common framework for 
processing this information to extract meaning and create structured data about the 
information. 

Software that creates machine-processable structure can utilize the linguistic, auditory, 
and visual structure that exist in all forms of human communication. Algorithms can infer this 
inherent structure from text, for instance, by examining word morphology, sentence syntax, 
and other small- and large-scale patterns. Unstructured information can then be enriched and 
tagged to address ambiguities and relevancy-based techniques then used to facilitate search 
and discovery. Examples of “unstructured data” may include books, journals, documents, 
metadata, health records, audio, video, analog data, images, files, and unstructured text such 
as the body of an e-mail message, Web page, or word-processor document. While the main 
content being conveyed does not have a defined structure, it generally comes packaged in 
objects (e.g. in files or documents) that themselves have structure and are thus a mix of 
structured and unstructured data, but collectively this is still referred to as“ unstructured data”. 
For example, an HTML web page is tagged, but HTML mark-up typically serves solely for 
rendering. It does not capture the meaning or function of tagged elements in ways that support 
automated processing of the information content of the page. XHTML tagging does allow 
machine processing of elements, although it typically does not capture or convey the semantic 
meaning of tagged terms. 

Since unstructured data commonly occurs in electronic documents, the use of a content 
or document management system which can categorize entire documents is often preferred 
over data transfer and manipulation from within the documents. Document management thus 
provides the means to convey structure onto document collections. 
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Search engines have become popular tools for indexing and searching through such data, 
especially text. 


XW New Words 
characteristic [ Keerikte'ristik] 7. 特性， 特征 
adj. 特 有 的 ， 表 示 特 性 的 ， 典 型 的 
model [model] nA, EA 
vt. 模 仿 
模拟 
memory [memari] 17. 存储 器 ， 内 存 
commercially [kə'mə:f əli] adv. 商 业 上 ， 贸 易 上 
huge [hju:d3] adfy. 巨 大 的 ， 极 大 的 ， 无 限 的 
supplement [‘saplimant] n. & vhi 
microfilm [maikreufilm] 7. 缩 影 胶 片 
.缩微 拍摄 
interchange Linta'tfeind3] Vt. 交换 
committee [ke'miti] DES E 
participate [pa:'tisipeit] Vi 参与 ， 参 加， 分 享 ， 分 担 
conform [ken'fo:m] wt 使 一 致 ， 使 遵守 ,使 顺从 
Vi. 符合 
tag [teeg] nn. 标签 ， 标 记 符 
Vt. 加 标签 于 
marker [ma:ke] .标记 
separate ['sepereit] aG AF B], DN; 个 别 的 ， 单 独 的 
VAT, BE, Ait 
hierarchy [haiera:ki] n.EX, BR 
entity ['entiti] nn. 实体 
increasingly [in'kri:sinli] adv. 日 益 ， 愈加 
medium [mi:djam] n. 媒 体 ， 媒 介 
adj. 中 间 的 ， 中 等 的 
popularize [‘popjuleraiz] Vy. 普及 
rigor [rige] nF, FE, HOÀ 
inherently [in'hiərəntli] adv. 天 性 地 ， 固 有 地 
epitome [rpitemi] .摘要 
virtually [ve:tjueli] adv. # 3: E, XR E 


schema ['ski:ma] nR, JŽ 


flexible 
capable 
alternative 


breed 
natively 
mismatch 


nested 
simplify 
restraint 
irregularity 
ambiguity 
potentially 
background 
hyperdimensional 
spark 
emergence 
terminology 
imprecise 
implied 
unanticipated 
auditory 
visual 

infer 
morphology 
relevancy 
capture 


XA Phrases 


structured data 
semi-structured data 
unstructured data 
fixed field 

data type 

data input 
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[fleksabl] 
[keipabl] 
[o:lta:nativ] 


[bri:d] 
[neitivli] 
[mis'maetf] 
[mismetf] 
[nestid] 
[simplifai] 
[ris'treint] 
[i.regju'leeriti] 
Leembi'gju:iti] 
[pa'tenf eli] 
[baekgraund] 
[haipe.dai'menf enel] 
[spa:k] 

[i me:dzens] 
[te:mi'noledsi] 
Limpri'sais] 
[im'plaid] 
Lanzen'tisipeitid] 
[o:ditari] 

[ vizjuel] 
[in'fa:] 
[mo:'foledsi] 
[relevansi] 
[keeptf 8] 


adj. 灵 活 的 ， 和 柔软 的 ， 能 变形 的 
adj. 有 能 力 的 ， 能 干 的 ， 有 可 能 的 
.二 中 择 一 ， 可 供 选 择 的 办 法 、 事 物 
adj. 选 择 性 的 ， 二 中 择 一 的 


ne AP, APE 

adv. 本 机 地 ， 本 地 地 
Vt 使 配 错 ， 使 配合 不 当 
nn. 错 配 

adj KEN 

vt 单一 化 ， 简 单 化 

nn. 抑制 ， 制 止 ， 克制 
nn. 不 规则 ， 无 规律 
ne, FAR 
adv. 潜 在 地 

nak, ae 
adj. & ? 8] 

vJ, BR; WR, RH 
.浮现 ， 露 出 ， 出 现 
1. 术 语 学 


adj. 不 严密 的 ， 不 精确 的 
ad. 暗 指 的 ,含蓄 的 

ad 不曾 预料 到 的 

ad. 耳 的 ， 听 觉 的 

ad. 看 的 ， 视 觉 的 ， 形 象 的 
vt iB 

nilik, FAF 

nA Ik 

nn. 及 vt. 捕获 


结构 化 数据 
半 结 构 化 数据 
非 结 构 化 数据 
固定 字段 
数据 类 型 
数据 输入 
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fit into 适合 
filing cabinet BRE 
database management system 数据 局 管理 系统 
American National Standards Institute 美国 国家 标准 协会 
semantic element 语义 元 素 
belong to 属于 
full-text document 全 文本 文档 
object-oriented database 面向 对 象 数据 库 
markup language 标记 语言 
suffer from Rk, HR 
data-centric application 以 数据 为 中 心 的 应 用 
garbage in, garbage out 垃圾 进 、 垃 圾 出 ; 无 用 数据 入 、 无 用 数据 出 
rule of thumb 经 验 法 则 ， 大 拇指 规则 
somewhere around KA 
result in 导致 ， 产 生 
account for 占据 
focused in on 着 重 于 ， 关 注 
be concerned with 注重 
predictive analytic 预测 分 析 
root cause analysis 根源 分 析 法 
Imay be characterized as 可 以 称 为 
at hand 在 手边 ， 在 附近 ， 即 将 到 来 
deal with 处 理 ， 涉 及 ， 安 排 
part-of-speech tagging 词性 标记 ， 词 类 标识 ， 词 类 标注 
sentence syntax 名 法， 语句 结构 ， 句 子 结构 
analog data 模拟 数据 
serve for 4, HE 
search engine 搜索 引擎 
search through ... 把 …… 仔 细 搜 寻 一 所 
从 Abbreviations 
INCITS (InterNational Committee for Information 国际 信息 技术 标准 委员 会 
Technology Standards) 
EDI (Electronic Data Interchange) 电子 数据 交换 
OEM (Object Exchange Model) 对 象 交换 模型 


SOAP (Simple Object Access Protocol) 简单 对 象 访问 协议 
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JSON (Java Script Object Notation) Java 脚本 对 象 符号 
REST (Representational State Transfer) 表述 性 状态 传递 
IDC (International Data Corporation) 国际 数据 公司 
SVD (Singular Value Decomposition) 奇异 值 分 解 
NLP (Natural Language Processing) 自然 语言 处 理 
UIMA (Unstructured Information Management 非 结 构 化 信息 管理 体系 结构 
Architecture) 
HTML (HyperText Markup Language) 超 文本 标记 语言 


XHTML (Extensible HyperText Markup Language) 扩展 超 文本 标记 语言 


XA Exercises 


【Ex. 5】 根据 课文 内 容 回答 以 下 问题 。 

1. What does structured data refer to? 

2. What advantage does structured data have? 

3. How is structured data often managed? 

4. What is SQL? When did it become an International Organization for Standards (ISO) 

standard? 

5. What is semi-structured data? 

6. What are the types of semi-structured data mentioned in the text? 

7. What are the disadvantages of using a semi-structured data format? 

8. What does unstructured data refer to? 

9. What techniques are used to deal with unstructured data? 

10. Why is the use of a content or document management system which can categorize entire 
documents often preferred over data transfer and manipulation from within the 
documents? 


参考 译文 


数据 结构 


数据 结构 是 组 织 和 存储 数据 的 特殊 格式 。 一 般 数据 结构 类 型 包括 数组 、 文 件 、 记录、 
表 、 树 等 。 所 有 数据 结构 的 设计 都 是 为 了 达到 某 一 特定 目的 而 组 织 数 据 ， 以 便 可 以 用 适 
当 的 方式 访问 和 工作 。 在 计算 机 编程 中 ， 为 了 可 以 用 多 种 算法 工作 ， 也 可 以 选择 或 设计 
数据 结构 来 存储 数据 。 
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1. 数组 


(D 在 数据 存储 中 ， 数 组 是 在 多 种 设备 上 存储 信息 的 方法 。 

Qo 一 般 来 说 ， 数 组 是 按照 特定 方法 〈 例 如 ， 以 列表 或 三 维 表 ) 排列 的 许多 项 目 。 

(3) 在 计算 机 编程 语言 中 ， 数 组 是 具有 相同 属性 、 可 以 使 用 如 加 下 标 这 样 的 技术 分 
别 访问 的 一 组 对 象 。 

(4) 在 随机 访问 存储 器 中 ， 数 组 是 许多 内 存单 元 的 排列 。 


2. 文 件 


COD 在 数据 处 理 中 ， 文 件 是 一 些 相 关 记 录 的 集合 。 例 如 ， 可 以 把 每 个 客户 的 记录 放 
到 一 个 文件 中 。 依次 地 , 每 个 记录 由 用 于 独立 数据 项 的 域 组 成 , 如 客户 姓名 、 客户 编号 、 
客户 地 址 等 。 通 过 在 每 个 记录 相同 域 中 提供 同类 信息 〈 这 样 所 有 记录 都 一 致 )， 文 件 可 
方便 地 被 计算 机 程序 访问 和 处 理 。 随 着 数据 库 的 出 现 ， 使 用 这 些 术 语 已 经 不 太 重要 了 ， 
而 且 它 的 重点 在 于 用 某 一 方法 集合 记录 和 域 数据 的 表 。 在 主机 系统 中 ， 术 语 数 据 集 通 党 
与 文件 同 义 , 但 意味 着 它 是 可 以 由 特定 访问 方式 辨认 的 特定 组 织 格式 。 取 决 于 不 同 的 操 
作 系 统 ， 文 件 《和 数据 集 》 可 以 包含 在 一 个 类 目 、 目 录 或 文件 夹 中 。 

(2) 在 任 一 计算 机 系统 ， 尤 其 是 个 人 计算 机 中 ， 文 件 是 系统 用 户 〈 包 括 系统 自身 及 
其 应 用 程序 ) 可 用 的 数据 实体 ， 可 以 将 其 作为 实体 来 处 理 〈 例 如 ， 从 一 个 文件 目录 移动 
到 另 一 个 目录 )。 在 自己 的 目录 中 ， 文 件 必须 有 唯一 的 名 字 。 某 些 操作 系统 和 应 用 程序 
通过 给 特定 格式 的 文件 特定 的 文件 名 后 绥 格 式 来 描述 文件 。 文 件 名 后 缀 也 称 作文 件 扩展 
名 。 例 如 ， 程 序 或 可 执行 文件 有 时 给 定 或 必须 有 “.exe” 后 级 。 一 般 情况 下 ， 后 缀 往往 
在 操作 系统 允许 的 字符 数 以 内 尽 可 能 地 描述 文件 的 格式 。 


3. 记录 


(1) 在 计算 机 数据 处 理 中 ， 记 录 是 排列 的 、 以 备 程 序 处 理 的 数据 项 的 集合 。 多 项 记 
录 可 以 组 成 文件 或 数据 集 。 以 记录 形式 组 织 的 结构 数据 通常 由 定义 记录 的 组 织 结构 的 编 
程 语言 规定 ， 并 /或 由 处 理 数据 的 应 用 程序 来 定义 。 通 常 ， 记 录 可 以 有 固定 的 长 度 ,或 带 
有 包含 在 记录 内 的 长 度 信 息 的 可 变 长 度 。 

(2) 在 数据 库 中 ， 记 录 一 一 有 时 也 称 作 行 一 一 是 与 特定 实体 相关 的 表 中 的 一 组 域 。 
例如 ， 在 一 个 客户 联系 信息 表 中 ， 一 行 中 通常 包含 这 样 的 域 : ID 号 、 姓 名 、 街 道 地 址 、 
城市 、 电 话 号 码 等 。 


4. 表 
在 计算 机 编程 语言 中 ， 表 是 用 来 组 织 信息 的 数据 结构 ， 就 像 在 纸 上 一 样 。 有 多 种 不 
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同 的 计算 机 关系 表 ， 用 许多 不 同 的 方式 来 工作 。 下 面 列 出 比较 普通 的 类 型 。 

CL) 在 数据 处 理 中 ， 表 也 称 作 数组 ， 是 组 织 好 的 一 组 域 。 表 可 以 存储 相对 不 变 的 数 
据 ， 也 可 被 频繁 更 新 。 例 如 ， 包 含 在 磁盘 卷 号 中 的 表 在 写 扇 区 时 就 被 更 新 。 

COD 在 关系 数据 库 中 ， 表 有 时 也 称 作文 件 ， 它 把 单一 标题 的 信息 组 织 到 行 和 列 中 。 
例如 ， 商 业 数 据 库 通常 包含 客户 信息 表 ， 该 表 中 会 用 许多 列 来 存储 客户 的 账号 、 地 址 、 
电话 号 码 等 。 数 据 的 每 个 单一 段 〈 如 账号 ) 是 表 中 的 一 个 域 。 一 列 包含 单个 域 中 的 全 部 
项 ， 如 全 部 客户 的 电话 号 码 。 依 次 地 ， 域 被 组 织 成 记录 ， 是 信息 的 完整 集合 ， 如 特定 客 
户 的 信息 集合 ， 每 条 信息 一 行 。 这 个 规范 化 处 理 决定 了 如 何 有 效 地 把 数据 组 织 到 表 中 。 

GO 决策 表 ， 通常 称 作 真 值 表 ， 可 以 用 计算 机 或 在 纸 上 简 单 画 出 ， 它 包含 一 系列 的 
决策 和 做 出 决策 所 依据 的 标准 。 用 于 决策 的 各 种 可 能 出 现 的 情况 都 必须 列 出 ,每 种 情况 
下 采用 的 行动 都 应 该 被 指定 。 一 个 简单 的 例子 是 : 对 于 交通 路 口 ， 通 行 决策 也 许可 以 表 
达 为 是 与 否 ， 标 准 也 许 是 红 灯 亮 或 绿灯 亮 。 

决策 表 可 以 插入 计算 机 程序 中 ， 以 便 根 据 不 同 的 情况 做 出 不 同 的 决策 。 决 策 表 的 改 
变 会 反映 在 程序 中 。 

(4) HTML 表 用 来 在 空间 上 组 织 网 页 元 素 , 或 建立 可 以 按照 表格 形式 更 好 地 显示 数 
据 的 数据 结构 ， 如 列表 或 清单 。 


Unit 6 


- E 
ee 
Text A 
Basic Concepts of Database 
1. Database 


A database is a collection of information that is organized so that it can easily be 
accessed, managed and updated. Databases can be classified according to types of content: 
bibliographic, full-text, numeric and images. 

In computing, databases are sometimes classified according to their organizational 
approach. The most prevalent approach is the relational database, a tabular database in which 
data is defined so that it can be reorganized and accessed in a number of different ways. A 
distributed database is one that can be dispersed or replicated among different points in a 
network. An object-oriented programming database is one that is congruent with the data 
defined in object classes and subclasses. 

Computer databases typically contain aggregations of data records or files, such as sales 
transactions, product catalogs and inventories, and customer profiles. Typically, a database 
manager provides users the capabilities of controlling read/write access, specifying report 
generation and analyzing usage. Databases and database managers are prevalent in large 
mainframe systems, but are also present in smaller distributed workstation and mid-range 
systems such as the AS/400 and on personal computers. 
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2. Relational Database 


A relational database is a collection of data items organized as a set of formally- 
described tables from which data can be accessed or reassembled in many different ways 
without having to reorganize the database tables. The relational database was invented by E. F. 
Codd at IBM in 1970. 

The standard user and application program interface to a relational database is the 
structured query language (SQL). SQL statements are used both for interactive queries for 
information from a relational database and for gathering data for reports. 

In addition to being relatively easy to create and access, a relational database has the 
important advantage of being easy to extend. After the original database creation, a new data 
category can be added without requiring that all existing applications be modified. 

A relational database is a set of tables containing data fitted into predefined categories. 
Each table (which is sometimes called a relation) contains one or more data categories in 
columns. Each row contains a unique instance of data for the categories defined by the 
columns. For example, a typical business order entry database would include a table that 
described a customer with columns for name, address, phone number, and so forth. Another 
table would describe an order: product, customer, date, sales price, and so forth. A user of the 
database could obtain a view of the database that fitted the user’s needs. For example, a 
branch office manager might like a view or report on all customers that had bought products 
after a certain date. A financial services manager in the same company could, from the same 
tables, obtain a report on accounts that needed to be paid. 

When creating a relational database, you can define the domain of possible values in a 
data column and further constraints that may apply to that data value. For example, a domain 
of possible customers could allow up to ten possible customer names but be constrained in 
one table to allowing only three of these customer names to be specifiable. 

The definition of a relational database results in a table of metadata or formal 
descriptions of the tables, columns, domains, and constraints. 


3. SQL 


SQL (Structured Query Language) is a standard language for making interactive queries 
from a database and updating a database such as IBM’s DB2, Microsoft’s Access and 
database products from Oracle, Sybase and Computer Associates. Although SQL is both an 
ANSI and an ISO standard, many database products support SQL with proprietary extensions 
to the standard language. Queries take the form of a command language that lets you select, 
insert, update, find out the location of data and so forth. There is also a programming 
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interface. 


4. Database Management System 


A database management system (DBMS), sometimes just called a database manager, is a 
program that lets one or more computer users create and access data in a database. The DBMS 
manages user requests (and requests from other programs) so that users and other programs 
are free from having to understand where the data is physically located on storage media and, 
in a multi-user system, which else may also be accessing the data. In handling user requests, 
the DBMS ensures the integrity of the data (that is, making sure it continues to be accessible 
and is consistently organized as intended) and security (making sure only those with access 
privileges can access the data). The most typical DBMS is a relational database management 
system (RDBMS). A standard user and program interface is the Structured Query Language 
(SQL). A newer kind of DBMS is the object-oriented database management system (ODBMS). 

A DBMS can be thought of as a file manager that manages data in databases rather than 
files in file systems. In IBM's mainframe operating systems, the non-relational data managers 
were (and are, because these legacy application systems are still used) known as access 
methods. 

A DBMS is usually an inherent part of a database product. On PCs, Microsoft's Access 
is a popular example of a single- or small-group user DBMS. Microsoft's SQL Server is an 
example of a DBMS that serves database requests from multiple (client) users. Other popular 
DBMSs (these are all RDBMSs, by the way) are IBM's DB2, Oracle's line of database 
management products, and Sybase's products. 

IBM's Information Management System (IMS) was one of the first DBMSs. A DBMS 
may be used by or combined with transaction managers, such as IBM's Customer Information 
Control System (CICS). 


5. Distributed Database 


A distributed database is a database in which portions of the database are stored on 
multiple computers within a network. Users have access to the portion of the database at their 
location so that they can access the data relevant to their tasks without interfering with the 
work of others. 


6. DDBMS 


A DDBMS (distributed database management system) is a centralized application that 
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manages a distributed database as if it were all stored on the same computer. The DDBMS 
synchronizes all the data periodically, and in cases where multiple users must access the same 
data, ensures that updates and deletes performed on the data at one location will be 
automatically reflected in the data stored elsewhere. 


7. Field 


In a database table, a field is a data structure for a single piece of data. Fields are 
organized into records, which contain all the information within the table relevant to a specific 
entity. For example, in a table called customer contact information, telephone number would 
likely be a field in a row that would also contain other fields such as street address and city. 
The records make up the table rows and the fields make up the columns. 


8. Record 


In a database, a record (sometimes called a row) is a group of fields within a table that 
are relevant to a specific entity. For example, in a table called customer contact information, a 
row would likely contain fields such as: ID number, name, street address, city, telephone 
number and so on. 


9. Table 


In a relational database, a table (sometimes called a file) organizes the information about 
a single topic into rows and columns. For example, a database for a business would typically 
contain a table for customer information, which would store customers’ account numbers, 
addresses, phone numbers, and so on as a series of columns. Each single piece of data (such 
as the account number) is a field in the table. A column consists of all the entries in a single 
field, such as the telephone numbers of all the customers. Fields, in turn, are organized as 
records, which are complete sets of information (such as the set of information about a 
particular customer), each of which comprises a row. The process of normalization determines 
how data will be most effectively organized into tables. 


XW New Words 
database [deitebeis] nn. 数 据 库 
organize ['3:genaiz] DELE 


classify [klaesifai] 证 分 类 ， 分 等 
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bibliographer ,bibliografe] ad .目录 的 
approach epreut[] nJrik, FR, WE, BH 
tabular "teebjule] adj ARK, OPE, KERI, PE iy 
wi. 列表 ， 排 成 表格 式 
distributed dis'tribju:tid] adj. 分 布 式 的 
disperse dis'pe:s] v. (使 ) 分 散 , CE) 散 开 
congruent 'Kangruent] adj. ( 5 with 连用 ) 一致 的 ， 适合 的 
aggregation zgri'geifan] nka, ROK, RE 
catalog ‘keetalog] .目录 ,目录 册 
vi Ho 
capability keipa biliti] n. (实际 ) 能 力 ， 性 能 ， 容 量 
analyze 'ænəlaiz] vt. A, AAE 
prevalent ‘prevalant] adj. 普 遍 的 ， 流 行 的 
set set] nka, $ 
reorganize ri'o:gənaiz] vY. 改 组 ， 再 编制 ， 改 造 
query kwiari] Vy. 询问 ， 查 询 
extend iks'tend] vZ, E, RUE 
instance 'instens] 7. 实例， 例证 
view vju:] nn. 视 
domain deu'mein n, Wm 
constraint kan'streint] n.Z X, ml 
specifiable 'spesifaiabl] adj. 能 指定 的 ; 能 详细 说 明 的 ;能 列举 的 
Oracle 'orekl] nn. 美 国 甲骨 文公 司 ， 主 要 生产 数据 库 产品 
command ka'ma:nd n&vi 
insert in'se:t] Vt 插入 
ensure in'Jua] vy. 确保 
privilege 'privilid3] nn. 特 权 
inherent in'hierent adj. 固 有 的 ， 内 在 的 
client ‘klaiant] 1 顾客， 客户， 委托 人 
centralize 'sentrelaiz] wi RR, KH 
synchronize 'sinkrenaiz] YY 同步 
periodically ,piariodikali] adv. 周 期 性 地 ， 定 时 性 地 
delete di li:t] WL 删除 
automatically o:te'meetikli] adv. B 2/3 
reflect riflekt] vat, 反映 ,表现 
field fi:ld] n. 
topic ‘topik] nè, WE 
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series ['sieri:z] nit, BP 

complete [kem'pli:t] adj. ZAH, TAU, FRY 
XA Phrases 

tabular database 表格 数据 库 

distributed database 分 布 式 数据 库 

customer profile 客户 简介 

find out 找 出 ; 发 现 


XWA Abbreviations 


SQL (structured query language) 结构 化 查询 语言 

IBM (International Business Machines Corporation) 国际 商用 机 器 公司 

ANSI (American National Standards Institute) 美国 国家 标准 协会 

ISO (International Organization for Standardization) 国际 标准 化 组 织 

DBMS (Database Management System) 数据 库 管 理 系统 

RDBMS (Relational Database Management System ) 关系 型 数据 库 管 理 系统 

ODBMS (object-oriented database management system) ” 面向 对 象 的 数据 库 管 理 系统 

IMS (Information Management System) 信息 管理 系统 

CICS (Customer Information Control System) 客户 信息 管理 系统 

DDBMS (distributed database management system) 分 布 式 数据 库 管 理 系 统 
XA Notes 


[1] A relational database is a collection of data items organized as a set of formally-described 
tables from which data can be accessed or reassembled in many different ways without 
having to reorganize the database tables. 
本 句 中 , organized as a set of formally-described tables 是 一 个 过 去 分 词 短 语 , 作 定 语 ， 
修饰 和 限定 data items， 它 可 以 扩展 成 一 个 定语 从 句 : which are organized as a set of 
formally-described tables. from which data can be accessed or reassembled in many 
different ways without having to reorganize the database tables 是 一 个 介词 前 置 的 定语 
从 句 ， 修 饰 和 限定 formally-described tables. 

[2] A database management system (DBMS), sometimes just called a database manager, is a 


program that lets one or more computer users create and access data in a database. 


AJ, sometimes just called a database manager 对 A database management 
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system (DBMS) 做 进一步 补充 说 明 。that lets one or more computer users create and 
access data in a database 是 一 个 定语 从 句 ， 修 饰 和 限定 a program. 在 该 从 句 中 ，create 
and access data in a database 是 一 个 不 带 to 的 动词 不 定式 短语 ， 作 宾语 one or more 
computer users 的 补足 语 。 

英语 中 ， 在 make. let, have. see. hear. watch, notice, feel 等 动词 后 面 用 动词 不 
定式 作 宾 语 补足 语 时 ， 不 定式 都 不 带 to。 但 当 宾 语 补足 语 变 成 主语 补足 语 时 ，to 不 
能 省 咯 。 请 看 下 例 : 

His boss often makes him work on weekends without extra pay. 

他 老板 经 常 让 他 周末 加 班 ， 却 不 给 他 额外 报酬 。 

Let each man decide for himself. 

让 每 个 人 自己 决定 。 

Someone was heard to come up the stairs. 

听见 有 人 上 楼 。 

[3] A DDBMS (distributed database management system) is a centralized application that 
manages a distributed database as if it were all stored on the same computer. 

本 名 中，that manages a distributed database as if it were all stored on the same computer 
是 一 个 定语 从 句 ， 修 饰 和 限定 a centralized application。 在 该 从 句 中 ，as if it were all 
stored on the same computer 是 一 个 方式 状语 从 句 。 

英语 中 ，as if fl as though 引导 的 方式 状语 从 句 一 般 要 用 虚拟 语气 。 请 看 下 例 : 

He talks as if he were a knowing-all. 

他 说 起 话 来 好 像 他 是 一 个 百事 通 。 

[4] The DDBMS synchronizes all the data periodically, and in cases where multiple users 

must access the same data, ensures that updates and deletes performed on the data at one 
location will be automatically reflected in the data stored elsewhere. 
本 句 中 ，in cases where multiple users must access the same data 作 条 件 状语 。that 
updates and deletes performed on the data at one location will be automatically reflected 
in the data stored elsewhere 是 一 个 宾语 从 句 , 作 ensures 的 宾语 。 在 该 从 句 中 ,performed 
on the data at one location 是 一 个 过 去 分 词 短语 ， 作 定语 ， 修 饰 和 限定 updates and 
deletes. 

[5] For example, in a table called customer contact information, telephone number would 
likely be a field in a row that would also contain other fields such as street address and 
city. 

本 句 中 ，called customer contact information 是 一 个 过 去 分 词 短语 ， 作 定语 ， 修 饰 和 限 
定 a table. that would also contain other fields such as street address and city 是 一 个 定语 


从 句 ， 修 饰 和 限定 arow。 
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XA Exercises 


【Ex. 1】 根据 课文 内 容 判 断 以 下 叙述 的 正 误 。 

1. A database is a collection of organized information. 

2. A relational database is a tabular database in which data is defined so that it can be 
reorganized and accessed in a number of different ways. 

3. Adistributed database is one that can be dispersed or replicated at certain points in a network. 

4. An object-oriented programming database is one that is congruent with the data defined in 
object classes and subclasses. 

5. Databases and database managers are prevalent only in large mainframe systems. 

6. The most typical DBMS is a distributed database management system. 

7. A DBMS can be thought of as a file manager that manages data in databases. 

8. The records make up the columns and the fields make up the table rows. 


【Ex. 2】 根据 课文 内 容 填空 。 


1. According to types of content databases can be classified into ; 


$ and images. 
2. The relational database was invented by in š 
3. SQL stands for . It is a standard language for making interactive 
queries from a database and updating a database. 
4. SQL statements are used both for from a relational 


database and 


5. Queries take the form of a command language that lets you 5 
, find out the location of data and so forth. There is also 


6. A database management system (DBMS), sometimes just called „isa 


program that lets one or more computer users in a 
database. 

7. A standard user and program interface is . A newer kind of 
DBMS is š 

8. A DDBMS (distributed database management system) is a centralized application that 
manages as if it were all stored on the same computer. 

9. In a database table, a field is . Fields are organized into 


10. In a database, a record is a group of fields within that are relevant to 
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【Ex. 3】 从 题 后 的 词组 中 选择 与 以 下 各 条 叙述 意义 最 接近 的 词汇 。 

1. The type of computer processing where the user of the system communicates directly with 
the system to input data and instructions and receive output. 

2. The boundary between two systems; a shared boundary between two systems. 

3. The capability of having two or more jobs in the computer at the same time. Execution of 
the program is interleaved so that in a time interval each job will have been (partly) 
processed. Processing is not simultaneous. 

4. A pictorial representation of processes and procedures for operation on data. A diagram that 
describes documents, procedures, processes, and equipment used in processing data in a 
specific application. 

5. Performing tests and checks on input to ensure that the input operation is legal and that the 
input it self is correct. Pertaining to a wide variety of tests that can be applied to ensure the 
correctness of data being input to a computer system. 


供 选择 的 答案 : 
A. decision table B. environment 
C. flowchart D. input/output system 


F. integrated circuit 
H. interface 


E. input validation 
G. interactive computing 
I. Multiprogramming 


[Ex 4] 选择 填空 。 

(1) analysis emphasizes the drawing of pictorial system models to document and 
validate both existing and/or proposed systems. Ultimately, the system models become 
the — (2) for designing and constructing an improved system. _ (3) _ is such a 
technique. The emphasis in this technique is process-centered. Systems analysts draw a series 
of process models called (4) . _ (5) _ is another such technique that integrates data 
and process concerns into constructs called objects. 


供 选 择 的 答案 : 
1. A. Prototyping B. Accelerated C. Model-driven D. Iterative 
2. A. image B. picture C. layout D. blueprint 
3. A. Structured analysis B. Information Engineering 

C. Discovery Prototyping D. Object-Oriented analysis 
4. A. PERT B. DFD C.ERD D. UML 


5. A. Structured analysis 
C. Discovery Prototyping 


B. Information Engineering 
D. Object-Oriented analysis 
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Text B 


How Cloud Storage Works 


Comedian George Carlin has a routine in which he talks about how humans seem to 
spend their lives accumulating “stuff.” Once they’ve gathered enough stuff, they have to find 
places to store all of it. If Carlin were to update that routine today, he could make the same 
observation about computer information. It seems that everyone with a computer spends a lot 
of time acquiring data and then trying to find a way to store it. 

For some computer owners, finding enough storage space to hold all the data they’ve 
acquired is a real challenge. Some people invest in larger hard drives. Others prefer external 
storage devices like thumb drives or compact discs. Desperate computer owners might delete 
entire folders worth of old files in order to make space for new information. But some are 
choosing to rely on a growing trend: cloud storage. 

While cloud storage sounds like it has something to do with weather fronts and storm 
systems it really refers to saving data to an off-site storage system maintained by a third party. 
Instead of storing information to your computer’s hard drive or other local storage device, you 
save it to a remote database. The Internet provides the connection between your computer and 
the database. 

On the surface, cloud storage has several advantages over traditional data storage. For 
example, if you store your data on a cloud storage system, you'll be able to get to that data 
from any location that has Internet access. You wouldn’t need to carry around a physical 
storage device or use the same computer to save and retrieve your information. With the right 
storage system, you could even allow other people to access the data, turning a personal 
project into a collaborative effort. 


1. Cloud Storage Basics 


There are hundreds of different cloud storage systems. Some have a very specific focus, 
such as storing Web e-mail messages or digital pictures. Others are available to store all forms 
of digital data. Some cloud storage systems are small operations, while others are so large that 
the physical equipment can fill up an entire warehouse. The facilities that house cloud storage 
systems are called data centers. 

At its most basic level, a cloud storage system needs just one data server connected to 
the Internet. A client (e.g; a computer user subscribing to a cloud storage service) sends 
copies of files over the Internet to the data server, which then records the information. When 
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the client wishes to retrieve the information, he or she accesses the data server through a 
Web-based interface. The server then either sends the files back to the client or allows the 
client to access and manipulate the files on the server itself. 

Cloud storage systems generally rely on hundreds of data servers. Because computers 
occasionally require maintenance or repair, it’s important to store the same information on 
multiple machines. This is called redundancy. Without redundancy, a cloud storage system 
couldn’t ensure clients that they could access their information at any given time. Most 
systems store the same data on servers that use different power supplies. That way, clients can 
access their data even if one power supply fails. 

Not all cloud storage clients are worried about running out of storage space. They use 
cloud storage as a way to create backups of data. If something happens to the client’s 
computer system, the data survives off-site. It’s a digital-age variation of “don’t put all your 
eggs in one basket.” 


2. Examples of Cloud Storage 


There are hundreds of cloud storage providers on the Web, and their numbers seem to 
increase every day. Not only are there a lot of companies competing to provide storage, but 
also the amount of storage each company offers to clients seems to grow regularly. 

You're probably familiar with several providers of cloud storage services, though you 
might not think of them in that way. Here are a few well-known companies that offer some 
form of cloud storage: 

* Google Docs allows users to upload documents, spreadsheets and presentations 
to Google’s data servers. Users can edit files using a Google application. Users can 
also publish documents so that other people can read them or even make edits, which 
means Google Docs is also an example of cloud computing. 

* Web e-mail providers like Gmail, Hotmail and Yahoo! Mail store e-mail messages on 
their own servers. Users can access their e-mail from computers and other devices 
connected to the Internet. 

* Sites like Flickr and Picasa host millions of digital photographs. Their users create 
online photo albums by uploading pictures directly to the services’ servers. 

* YouTube hosts millions of user-uploaded video files. 

* Web site hosting companies like StartLogic, Hostmonster and GoDaddy store the files 
and data for client Web sites. 

* Social networking sites like Facebook and MySpace allow members to post pictures 
and other content. All of that content is stored on the respective site's servers. 

* Services like Xdrive, MediaMax and Strongspace offer storage space for any kind of 
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digital data. 

Some of the services listed above are free. Others charge a flat fee for a certain amount 
of storage, and still others have a sliding scale depending on what the client needs. In general, 
the price for online storage has fallen as more companies have entered the industry. Even 
many of the companies that charge for digital storage offer at least a certain amount for free. 


3. Concerns about Cloud Storage 


The two biggest concerns about cloud storage are reliability and security. Clients aren’t 
likely to entrust their data to another company without a guarantee that they'll be able to 
access their information whenever they want and no one else will be able to get at it. 

To secure data, most systems use a combination of techniques, including: 

e Encryption, which means they use a complex algorithm to encode information. To 
decode the encrypted files, a user needs the encryption key. While it’s possible to 
crack encrypted information, most hackers don't have access to the amount 
of computer processing power they would need to decrypt information. 

* Authentication processes, which require to create a user name and password. 

* Authorization practices—the client lists the people who are authorized to access 
information stored on the cloud system. Many corporations have multiple levels of 
authorization. For example, a front-line employee might have very limited access to 
data stored on a cloud system, while the head of human resources might have 
extensive access to files. 

Even with these protective measures in place, many people worry that data saved on a 
remote storage system is vulnerable. There's always the possibility that a hacker will find an 
electronic back door and access data. Hackers could also attempt to steal the physical 
machines on which data are stored. A disgruntled employee could alter or destroy data using 
his or her authenticated user name and password. Cloud storage companies invest a lot of 
money in security measures in order to limit the possibility of data theft or corruption. 

The other big concern, reliability, is just as important as security. An unstable cloud 
storage system is a liability. No one wants to save data to a failure-prone system, nor do they 
want to trust a company that isn't financially stable. While most cloud storage systems try to 
address this concern through redundancy techniques, there's still the possibility that an entire 
system could crash and leave clients with no way to access their saved data. 

Cloud storage companies live and die by their reputations. It's in each company's best 
interests to provide the most secure and reliable service possible. If a company can't meet 
these basic client expectations, it doesn't have much of a chance—there are too many other 
options available on the market. 
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XW New Words 
routine [ru:'ti:n] nn. 例 行程 序 ， 常 规 
adj. 常 规 的 ， 例 行 的 
accumulate [a'kju:mjuleit] vR, ER 
stuff [staf] nJ B, WR 
L3 MES MES 
observation [.ebze: veif an] nF, WR; 观察 资料 ( 或 报告 ) 
desperate [despari] adj. 不 顾 一 切 的 ， 拼 死 的 ， 令 人 绝望 的 
folder [feulde] n.X ft * 
e-mail [ii meil] .电子 邮件 
form [fo:m] .形式 ,表格 
VÝR, HR, HA, CE) 组 成 
equipment [i'kwipment] nki, Re, BH, RE 
house [haus] v 给 …… 提 供 地方 ; 收藏 ， 安 置 
survive [savaiv] VERT, X4. bit 
variation [veari'eif an] nn. 变 更， 变化， 变异 ， 变 种 
regularly ['regjuleli] adv. 有 规律 地 ， 有 规则 地 ; 整齐 地 
upload [Apleud] vt. &n tte, E 
entrust [in'trast] Vy. 委托 
crack [kraek] VB 
hacker [haeke] .黑客 
decrypt [di:'kript] vfi dr 
password [pa:swe:d] .密码 ,口令 
authorization [.o:8arai'zeif en] na, ATT 
front-line [frant-lain] adj. 前 线 的 
extensive [iks'tensiv] adj. 广 大 的 ， 广阔 的 ,广泛 的 
vulnerable [vAlnerebl] adj. 易 受 攻击 的 
disgruntled [dis'grantld] adj. 不 满 的 ， 不 高 兴 的 
address [a'dres] vy 解决 ， 处 理 
crash [kraef] n. & viii, X6 
reputation [Lrepju(:)'teif en] nA, AP 
XA Phrases 
cloud storage 云 存 储 


storage space 存储 空间 
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invest in 投资 于 

hard drive 硬盘 驱动 器 

external storage device 外 部 存储 设备 
weather front 锋面 ， 天 气 情况 
storm system 风暴 系统 

thumb drive Ut 

compact disc 光盘 

have something to do with... Grr 有 点 关系 
offsite storage 异地 存储 ， 远 程 存储 
third party 第 三 方 

local storage device 本 地 存储 设备 

on the surface 表面 上 

physical storage device 物理 存储 设备 

fill up 填补 ， 装 满 
subscribing to 订购 

power supply 电源 

flat fee 固定 费用 

be worried about Ad 忧虑 ， 烦 恼 的 
be familiar with 熟悉 

social networking site 社交 网 络 网 站 

sliding scale 浮动 制 计 费 ， 浮动 费 率 制 , 按 比例 增 减 
multiple level 多 层 

protective measures 保全 措施 ， 保 护 措施 
back door 后 门 


XA Exercises 


【Ex. 5】 根据 课文 回答 以 下 问题 。 
1. What is a real challenge for some computer owners? 
2. What might desperate computer owners do in order to make space for new information? 
3. What does cloud storage really refer to? 
4. What does a cloud storage system need at its most basic level? 


5. Why is it important to store the same information on multiple machines? 

6. What do cloud storage clients use cloud storage as? 

7. What does Google Docs allows users to do? 

8. What do social networking sites like Facebook and MySpace allow members to do? 
Where is all that content stored? 
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9. What are the two biggest concerns mentioned in the passage about cloud storage? 
10. What do most systems do to secure data? 


参考 译文 
数据 库 基 本 概念 


1. 数据 库 


数据 库 是 信息 的 集合 ， 这 些 信息 被 组 织 起 来 以 便 可 以 容易 地 访问 、 管 理 和 更 新 。 数 
据 库 可 以 按照 其 内 容 分 为 以 下 几 类 : 书籍 目录 数据 库 、 全 文本 数据 库 、 数 字数 据 库 和 图 
像 数 据 库 。 

在 计算 领域 中 ， 数 据 库 有 时 也 按照 其 组 织 方法 来 分 类 。 当 前 最 流行 的 方法 就 是 关系 
数据 库 ， 即 一 个 定义 数据 的 、 以 便 可 以 用 多 种 不 同 的 方法 来 重新 组 织 和 访问 的 表格 式 数 
据 库 。 分 布 式 数据 库 是 一 个 在 网 络 中 许多 不 同 的 地 方 分 布 或 复制 的 数据 库 。 面 向 对 象 编 
程 数 据 库 是 一 个 适合 用 对 象 类 和 子 类 定义 数据 的 数据 库 。 

计算 机 数据 库 通 常 包含 数 据 记录 或 文件 的 集合 ， 如 销售 业务 、 产 品目 录 和 库存 以 及 
客户 概况 。 通 常 ， 数 据 库 管理 程序 给 用 户 提供 控制 读 / 写 访问 、 产 生 报表 和 分 析 使 用 情况 
的 能 力 。 数 据 库 和 数据 库 管理 程序 在 大 型 机 系统 中 非常 普遍 ， 但 也 出 现在 更 小 的 分 布 式 
工作 站 和 中 等 规模 的 系统 中 ， 如 出 现在 AS/400 或 个 人 计算 机 中 。 


2. 关系 数据 库 


关系 数据 库 是 数据 项 的 集合 ， 这 些 数据 项 组 织 为 正式 描述 的 表格 的 一 个 集合 ， 其 中 
的 数据 可 以 用 多 种 方式 访问 或 调整 而 无 须 重 新 组 织 数 据 库 表 。 关 系数 据 库 由 E. F. Codd 
于 1970 年 在 IBM 创造 。 

关系 数据 库 的 标准 用 户 和 应 用 程序 接口 是 结构 化 查询 语言 (SQL) 。SQL 语句 既 可 
用 于 对 关系 数据 库 进 行 交互 式 信息 查询 ， 也 可 用 于 收集 报表 信息 。 

除了 相对 容易 建立 和 访问 之 外 ， 关 系数 据 库 的 主要 优点 是 容易 扩展 。 建 立 了 原始 数 
据 库 后 ， 可 以 增加 新 的 数据 库 类 别 而 无 须 对 现 有 所 有 应 用 进行 修改 。 

关系 数据 库 是 包含 预 设 种 类 中 数据 的 表格 的 集合 。 每 个 表 (有 时 也 叫 作 关系 〉 按 列 
包含 一 个 或 多 个 数据 类 。 每 行 包括 由 列 所 定义 的 类 型 的 唯一 数据 项 。 例如， 一 个 典型 的 
商务 定单 项 数据 库 可 以 包括 一 个 描述 客户 的 表 , 该 表 列 有 客户 姓名 、 地 址 、 电 话 号码 等 。 
另 一 个 表 描 述 订单 : 产品 、 客 户 、 日 期 、 销 售 价格 等 。 该 数据 库 的 用 户 可 以 获得 他 所 需 
要 的 数据 库 概况 。 一 个 分 部 经 理 也 许 需要 在 某 个 日 期 之 后 购买 产品 的 全 部 客户 的 概况 或 
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报表 。 同 一 公司 的 金融 服务 经 理 可 以 从 同一 表 中 获得 需要 支付 的 账号 报表 。 
建立 一 个 关系 数据 库 后 ， 可 以 在 一 个 数据 列 中 定义 可 能 值 的 域 以 及 未 来 可 以 应 用 到 
这 些 值 的 约束 。 例 如 ， 一 个 潜在 客户 域 最 多 可 以 允许 有 10 个 客户 的 名 称 ， 但 限制 在 一 
个 表 中 只 能 列 出 3 个 这 样 的 客户 。 

关系 数据 库 的 定义 会 产生 一 个 元 数据 表 或 对 该 表 、 列 、 域 和 约束 的 正式 描述 。 


3.SQL 


SQL (结构 化 查询 语言 ) 是 一 个 标准 语言 ， 用 来 对 数据 库 进 行 交 互 式 查询 并 更 新 数 
据 库 , 如 IBM DB2、Microsoft Access 以 及 来 自 Oracle, Sybase 的 数据 库 产 品 和 Computer 
Associates。 尽 管 SQL 既是 一 个 ANSI 标准 ， 也 是 一 个 ISO 标准 ， 但 许多 产品 支持 对 标 
准 语言 的 专门 扩展 的 SQL。 请求 的 形式 是 命令 行 语言 ,可 以 让 用 户 进行 选择 、 插 入 、 更 
新 、 找 出 数据 的 位 置 等 。 也 有 一 个 编程 接口 。 


4. 数据 库 管理 系统 


数据 库 管理 系统 (DBMS) 有 时 也 叫 作 数据 库 管 理 器 ， 是 让 一 个 或 多 个 计算 机 用 户 
建立 和 访问 数据 库 中 数据 的 程序 。DBMS 管理 用 户 查询 〈 及 来 自 其 他 程序 的 查询 ) ， 这 
样 用 户 和 其 他 程序 就 不 需要 知道 这 些 数据 在 介质 中 存储 的 物理 位 置 , 并 且 在 多 用 户 系 统 
中 ， 也 不 必 知 道 还 有 谁 可 能 正在 访问 这 些 数据 。 在 处 理 用户 查 询 时 ，DBMS 确保 数据 的 
完整 性 〈 也 就 是 ， 确 保 可 以 持续 地 被 访问 并 且 一 直 按 照 预先 要 求 组 织 好 ) 和 安全 性 〈 确 
保 只 有 那些 有 访问 权 的 用 户 才 可 以 访问 这 些 数 据 ) 。 最 典型 的 DBMS 是 关系 数据 库 管理 
系统 (RDBMS ) 。 一 个 标准 的 用 户 和 程序 接口 是 结构 化 查询 语言 《SQL ) 。 一 个 更 新 的 
DBMS 是 面向 对 象 数据 库 管理 系统 CODBMS) 。 

DBMS 可 以 被 看 作 一 个 文件 管理 器 ， 它 管理 数据 库 中 的 数据 而 不 是 文件 系统 中 的 文 
ft. fE IBM 的 大 型 机 操作 系统 中 ， 非 关系 数据 管理 器 曾经 〈 并 且 现 在 也 是 ， 因 为 这 些 
老 的 应 用 系统 仍然 在 使 用 ) 以 访问 方法 而 广为人知 。 

DBMS 通常 是 数据 库 产品 的 固有 部 分 。 在 PC LE, Microsoft Access 是 单一 或 小 组 用 
户 DBMS 的 一 个 流行 范例 。Microsoft SQL Server 是 适用 于 多 用 户 〈 客 户 ) 数据 库 查询 
的 一 个 范例 。 其 他 流行 的 DBMS (顺便 说 一 下 , 这 些 全 部 都 是 RDBMS ) 是 IBM 的 DB2、 
Oracle 的 数据 库 管 理 产品 线 以 及 Sybase 的 产品 。 

IBM 的 信息 管理 系统 OMS) 是 最 初 的 DBMS 之 一 。DBMS 也 可 被 像 IBM 的 客户 
信息 管理 系统 (CICS) 这 样 的 业务 管理 程序 使 用 ， 或 与 其 结合 使 用 。 


5. 分 布 式 数据 库 
分 布 式 数 据 库 是 数据 库 中 的 某 些 部 分 存储 在 网 络 中 的 多 个 计算 机 中 的 数据 库 。 用 户 
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可 以 在 自己 的 位 置 访问 该 数据 库 的 一 部 分 , 这 样 他 们 可 以 访问 与 其 工作 相关 的 数据 而 不 
会 影响 其 他 人 的 工作 。 


6. DDBMS 


DDBMS (分 布 式 数据 库 管理 系统 ) 是 一 个 集中 应 用 程序 , 管理 一 个 分 布 式 数据 库 ， 
就 像 该 数据 库存 储 在 同一 计算 机 上 一 样 。DDBMS 定期 地 保证 所 有 数据 的 同步 ， 并 且 在 
多 个 用 户 必须 访问 相同 数据 的 情况 下 , 确保 在 一 个 地 方 对 数据 的 更 新 和 删除 在 其 他 地 方 
存储 的 数据 中 会 自动 反映 出 来 。 


7. 字段 


在 数据 库 表 中 ， 字 段 是 用 于 单一 数据 块 的 数据 结构 。 字 段 组 成 为 记录 ， 包 括 表 中 与 
特定 实体 相关 的 全 部 信息 。 例 如 ， 在 一 个 叫 作客 户 联系 信息 的 表 中 ， 电 话 号 码 可 能 是 一 
行 中 的 一 个 字段 ， 该 行 也 包含 了 其 他 字段 ， 如 街道 地 址 和 城市 。 记 录 构 成 了 表 行 而 字段 
构成 了 列 。 


8. 记录 


在 数据 库 中 , 记录 (有 时 也 叫 作 行 ) 是 表 中 与 一 个 特定 实体 相关 的 一 组 字段 。 例 如 ， 
在 一 个 叫 作客 户 联系 信息 的 表 中 , 一 行 可 能 包括 这 样 的 字段 : 标识 号 、 名 字 、 街道 地 址 、 
城市 、 电 话 号 码 等 。 


9. 表 


在 关系 数据 库 中 ， 表 (有 时 叫 作 文件 ) 把 单一 主题 的 信息 组 成 为 行 和 列 。 例 如 ， 一 
个 商用 数据 库 通常 包括 一 个 客户 信息 表 ， 该 表 把 客户 账号 、 地 址 、 电 话 号 码 等 存储 为 一 
系列 的 列 。 每 个 单一 的 数据 块 〈 如 账号 ) 是 表 中 的 字段 。 一 列 由 单一 字段 的 全 部 实体 组 
成 ， 如 全 部 客户 的 电话 号 码 。 字 段 依次 被 组 织 为 记录 ， 这 就 组 成 了 信息 的 完整 集合 〈 如 
某 一 特定 客户 的 信息 集合 ) ， 每 个 记录 构成 一 行 。 这 个 规范 处 理 过 程 决定 了 怎样 将 数据 
最 有 效 地 组 织 为 表 。 
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Text A 


Data Warehouse Frequently Asked Questions 


Q1. What is a data warehouse? How does it differ from a Database 
Management System (DBMS)? 


A data warehouse is a database that provides users with data extracted from online 
transaction processing systems, batch systems, and externally syndicated data. 

By contrast, a DBMS is software that controls the data in a database. It provides data 
security, data integrity, interactive queries, interactive data-entry and updating, and data 
independence. 


Q2. How do I know if my organization needs a data warehouse? 


Ideal candidates for a data warehouse display three common characteristics: They 
operate in a highly competitive industry; They have vast amounts of data; And they are 
struggling with the integration of widely dispersed data. If your organization fits this profile, 
it could well benefit from implementing a data warehouse. 


Q3. What’s involved in designing, building, and implementing a data 
warehouse? 
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Atre Group has identified 12 distinct steps in what we call our Data Warehouse 
Navigator. They are: 
1. Determine users’ needs 
. Determine DBMS server platform 
. Determine hardware platform 
. Information and data modeling 
. Construct metadata repository 
. Data acquisition and cleansing 
. Data transform, transport and populate 
. Determine middleware connectivity 
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. Prototyping, querying and reporting 
10. Data mining 

11. Online analytic processing (OLAP) 
12. Deployment and system management 


Q 4. Which of the steps takes the longest? 


It may vary from organization to organization, but in most of the situations, activities 
surrounding Extract, Transform and Load (ETL) are the most time consuming activities. 
There are multiple reasons for this: 

1. Data is scattered in various disparate sources, and it is stored repeatedly in different 
files. Most organizations don't necessarily have decent documentation that is the authoritative 
source that says what is what. Identifying the correct sources of data to be used for a data 
warehouse is a daunting task. 

2. In most organizations, there are various versions of a so called Meta Data repository. 
Meta Data is data about data. Sorting out the information in the Meta Data repositories is very 
time consuming. 

3. Quality of data leaves a lot to be desired. Dirty data is either inaccurate data or 
inconsistent data. Once again, it is challenging to determine what is clean and what is dirty. In 
order to make this distinction, one needs to work together with business representatives who 
are knowledgeable in the business rules. The best business representatives are always busy. 
As a result, it is very difficult to get their attention. 

4. It is time consuming to identify the data needed for analysis purposes to be used in a 
data warehouse. 

5. The sub steps are: extraction, scrubbing, reconciling, aggregating, and summarizing 
the data. Each one of these sub steps is also time consuming. 

And as a result, the entire process of ETL is the longest lasting step. 
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Q5. What is data mining? Is it a part of a data warehouse effort? 


Data mining is finding patterns in the data that are not easily detectable by intuition or 
experience. Data mining could be a part of a data warehouse effort, or it could be a separate 
activity. 

A major difference between a data warehouse and data mining is that most times, one 
uses summaries while using a data warehouse; whereas for data mining, detailed level data is 
needed. The patterns usually get lost when the data is summarized. 

In a number of organizations, data mining is considered a part of the data warehouse 
effort. 


Q6. What is OLAP? 


OLAP stands for online Analytical Processing and is a technique for processing large 
amounts of data for the purposes of business analysis. The fundamental goal of OLAP is to 
exponentially improve the time it takes to query or read business data. It fundamentally differs 
from operational processing, commonly referred to as OLTP (On-Line Transaction 
Processing), which is built to achieve better write performance. OLAP Servers process data 
summaries to predetermine results of“ What If” analysis. Normally, OLAP servers extract data 
from the data warehouse and then summarize and organize the data into multidimensional 
structures, commonly known as Cubes. The multidimensional data structures (or cubes) make 
it simple and efficient for users to formulate complex queries, arrange data on a report, switch 
from summary to detail data and filter or slice data into meaningful subsets. 


Q7. What is warehouse appliance? 


A data warehouse appliance is a combination of hardware and software product that is 
designed specifically for analytical processing. An appliance allows the purchaser to deploy a 
high-performance data warehouse right out of the box. 

In a traditional data warehouse implementation, the database administrator (DBA) can 
spend a significant amount of time tuning and putting structures around the data to get the 
database to perform well. With a data warehouse appliance, however, it is the vendor who is 
responsible for simplifying the physical database design layer and making sure that the 
software is tuned for the hardware. 

When a traditional data warehouse needs to be scaled out, the administrator needs to 
migrate all the data to a larger, more robust server. When a data warehouse appliance needs to 
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be scaled out, the appliance can simply be expanded by purchasing additional pug-and-play 
components. 

A data warehouse appliance comes with its own operating system, storage, database 
management system (DBMS) and software. It uses massively parallel processing (MPP) and 
distributes data across integrated disk storage, allowing independent processors to query data 
in parallel with little contention and redundant components to fail gracefully without harming 
the entire platform. Data warehouse appliances use Open Database Connectivity (ODBC), 
Java Database Connectivity (JDBC), and OLE DB interfaces to integrate with other 
extract-transform-load (ETL) tools and business intelligence (BI) or business analytic (BA) 
applications. 

Currently, smaller data warehouse appliance vendors seem to be concentrating on adding 
functionality, such as in-memory analytics, to their products in order to compete with the 
mega-vendors. It is anticipated, however, that all appliance vendors will be impacted by the 
trend toward inexpensive, high-performance, scalable virtualized data warehouse 
implementations that use regular hardware and open source software. 


wa New Words 

online [onlain] n. 联 机 ， 在 线 式 
syndicate [sindikit] 7. 企业 联合 组 织 
candidate [Kaendidit] .候选 人 
vast [va:st] adj. 巨 大 的 ， 大 量 的 
dispersed [dis'pe:st] ady. 被 分 散 的 ， 散 布 的 
construct [ken'strakt] Vt. 建造 ， 构 造 ， 创 立 
populate [popjuleit] vith 
surrounding [se'raundin] 17. 围绕 物 ， 环 境 

adj. 周 围 的 
scattered [skaetad] adj. 离 散 的 ， 分 散 的 
disparate [disparit] adj. F B] 
repeatedly [ripi:tidli] adv. 重 复 地 ， 再 三 地 
decent [di:snt] adj. 相 当 好 的 、 像 样 的 
authoritative [o:6oritativ] adj. 权 威 的 ,命令 的 
daunting [do:ntin] adj fE A E 4i 19 
inaccurate [in akjurit] adf. 错 误 的 ， 不 准确 的 
representative Lreprizentativ] nn. 代表 
scrub [skrab] vy 净化， 擦洗 


reconcile [rekansail] wt. 协调， 理 顺 
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pattern [paeten] n AR, BX 
detectable [di'tektebl] adf. 可 发 觉 的 ， 可 看 穿 的 
intuition [intju'if en] .直觉 
exponential [ekspau'nenfal] .指数 

adj. 488th, FEB 
predetermine [pri:di'te:min] 预定， 预先 确定 
multidimensional —_[,maltidi’menfenl] adj. & #1 
cube [kju:b] ntik, LH 
formulate [fo:mjuleit] vt. 用 公式 表示 ,规则 化 
arrange [areind3] .安排 ， 排 列 
filter [filte] 由 .筛选 

Vt 过滤 
combination [Kombi'neif an] ne, KE, SH 
deploy [diploi] VARA, R 
significant [sig'nifikent] adj. 有 意义 的 ， 重 大 的 ， 重 要 的 
migrate [mai'greit] Wi. 迁移 ， 移 动 ， 移 往 
robust [ra'bast] adf. 健 壮 的 
harm [ha:m] wie, We 

.伤害 ,损害 
functionality [fAnkaf a'naeliti] .功能 性 
anticipate [en'tisipeit] wii, We 
scalable ['skeilabl] adf. 可 升级 的 

XA Phrases 

Frequently Asked Question (FAQ) 常见 问题 
differ from 不 同 
provide with ... 给 …… 提 供 ， 以 …… 装备 
data independence 数据 独立 性 
competitive industry 竞争 性 产业 
data modeling 数据 建 模 
On line analytic processing (OLAP) 联机 分 析 处 理 
Meta Data 元 数据 
dirty data 脏 数据 ， 废 数据 
business rule 商务 规则 
multidimensional data structure 多 维 数据 结构 


formulate complex queries 规则 化 复杂 查询 
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tight out of the box 开 箱 即 用 

be scaled out 出 局 

pug-and-play component 即 插 即 用 部 件 
massively parallel processing (MPP) 大 规模 并 行 处 理 
Open Database Connectivity (ODBC) 开放 数据 库 互 连 
Java Database Connectivity (JDBC) Java 数据 库 连 接 
business intelligence (BI) 商务 智能 ， 商 业 智 能 
business analytic (BA) 商业 分 析 
concentrate on 集中 ， 全神贯注 于 
in-memory analytic 存储 器 内 分 析 
virtualized data warehouse 虚拟 数据 仓库 


XWA Abbreviations 


OLTP (On Line Transaction Processing) 联机 事务 处 理 系统 
OLE (Object Linking and Embedding) 对 象 连接 与 嵌入 
XA Notes 


[1] Most organizations don’t necessarily have decent documentation that is the authoritative 
source that says what is what. 
本 句 中 ，that is the authoritative source that says what is what 是 一 个 定语 从 句 ， 修 饰 和 
限定 decent documentation。 在 该 从 句 中 ，that says what is what 也 是 一 个 定语 从 句 ， 
修饰 和 限定 the authoritative source. what is what 是 宾语 从 句 ， 作 says 的 宾语 。 

[2] Identifying the correct sources of data to be used for a data warehouse is a daunting task. 
本 句 中 ，Identifying the correct sources of data to be used for a data warehouse 是 一 个 动 
名 词 短 语 , EEH. tobe used for a data warehouse 是 一 个 动词 不 定式 短语 ， 作 定 语 ， 
修饰 和 限定 the correct sources of data. 

[3] In order to make this distinction, one needs to work together with business representatives 


who are knowledgeable in the business rules. 
本 句 中 ，In order to make this distinction 是 一 个 目的 状语 ， 修 饰 谓 语 needs. who are 
knowledgeable in the business rules 是 一 个 定语 从 句 ， 修 饰 和 限定 business 
representatives. 

[4] A major difference between a data warehouse and data mining is that most times, one uses 
summaries while using a data warehouse; whereas for data mining, detailed level data is 
needed. 
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本 句 中 ，that most times, one uses summaries while using a data warehouse; whereas for 
data mining, detailed level data is needed 是 一 个 表 语 从 句 。 在 该 从 句 中 ，while using a 
data warehouse 做 时 间 状 语 ， 修 饰 uses; whereas 表示 对 比 。 
[5] It fundamentally differs from operational processing, commonly referred to as OLTP (On 
Line Transaction Processing), which is built to achieve better write performance. 
本 句 中 ,It 指 OLAP. commonly referred to as OLTP (On Line Transaction Processing) 
是 一 个 过 去 分 词 短语 ， 对 operational processing 进一步 补充 说 明 。which is built to 
achieve better write performance 是 一 个 非 限定 性 定语 从 句 ， 对 OLTP 进行 补充 说 明 。 
With a data warehouse appliance, however, it is the vendor who is responsible for 
simplifying the physical database design layer and making sure that the software is tuned 
for the hardware. 
本 句 中 ，it is the vendor who is responsible for simplifying the physical database design 
layer and making sure that the software is tuned for the hardware. /& it 引导 的 强调 句 型 。 
它 强 调 的 是 主语 vendor. 
英语 中 ，Itis/was + 强调 部 分 +that/who 从 句 也 可 以 强调 状语 。 例 如 : 
It was last week that he bought that new computer. 
It was in 1970 that E. F. Codd at IBM invented the relational database. 
请 注意 : 强调 谓语 时 要 用 do 或 did。 例 如 : 
Mike did send his manager an email yesterday, saying that he had fixed the printer. 


[6 


Please do come earlier next time. 


XA Exercises 


【Ex.1 】 根据 课文 内 容 回答 问题 。 


1. What is a data warehouse? 

2. What are the common characteristics ideal candidates for a data warehouse have? 

3. What is the third step in what the so called data warehouse Navigator? 

4. Which of the steps takes the longest in most of the situations? 

5. What does one need to do in order to determine what data is clean and what is dirty? 
6. What is data mining? 

7. What is the major difference between a data warehouse and data mining? 

8. What does OLPA stand for? What is it? 

9. What is warehouse appliance? 


10. With a data warehouse appliance, what is the vendor responsible for? 
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【 Ex.2 】 根据 给 出 的 汉语 词义 和 规定 的 词类 写 出 相应 的 英语 单词 。 每 词 的 首 字母 已 给 出 。 
用 公式 表示 ， 规 则 化 f 


adj BAW, TERHI r 
vJEJF, 配置 d 
adj. 离 散 的 ， 分 散 的 s 
nn. 批 处 理 b 
.服务器 s 
Wi 改变， 转化， 变换 t 
adi. 不 一 致 的 ， 不 协调 的 ， 矛盾 的 i 
n. 式 样 ， 模 式 

nn 联机， 在 线 式 

nn. 知识 库 ， 仓 库 r 
adj 21) d 
adj. 错 误 的 ， 不 准确 的 i 
adj. 基 础 的 ， 基 本 的 f 
adj. ER m 
1. 指数 e 
adf. 健 壮 的 r 
nn 功能 性 f 
adf. 可 升级 的 s 
nn 执行 i 


【Ex.3 】 把 下 列 句 子 翻译 为 中 文 。 

1. You can chat to other people who are online. 

2. Each summer a new batch of students tries to find work. 

3. Are your products and services competitive? How about marketing? 

4. My server is having problems this morning. 

5. It is proverbially easier to destroy than to construct. 

6. The photochemical reactions transform the light into electrical impulses. 

7. Experimental results show algorithm is robust to resist normal and geometrical attack. 
8. The feedback that comes from disparate industry and different area having different result. 
9. The other fundamental consideration in the conception of a plan is function. 

10. Populations tend to grow at an exponential rate. 
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[Ex4] 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


gather compile interactive operations product 


rapidly models organization figures specified 


A decision support system (DSS) is a computer-based information system that supports 
business or organizational decision-making activities. DSSs serve the management, __(1)_, 
and planning levels of an __(2)_ and help to make decisions, which may be __(3)__ changing 
and not easily _(4) in advance. 

DSSs include knowledge-based systems. A properly designed DSS is an _(G) | software- 
based system intended to help decision makers (6) useful information from a 
combination of raw data, documents, and personal knowledge, or business — (7) _ to 
identify and solve problems and make decisions. 

Typical information that a decision support application might _ (8) ^ and present 
includes:_inventories of information assets (including legacy and relational data sources, 
cubes, data warehouses, and data marts), comparative sales _ (9) _ between one period and 
the next, projected revenue figures based on _ (10) _ sales assumptions. 


Text B 


Data Backup 


In information technology, a backup, or the process of backing up, refers to the copying 
and archiving of computer data so it may be used to restore the original after a data loss event. 
The verb form is to back up in two words, whereas the noun is backup. 

Backups have two distinct purposes. The primary purpose is to recover data after its loss, 
be it by data deletion or corruption. The secondary purpose of backups is to recover data from 
an earlier time, according to a user-defined data retention policy, typically configured within a 
backup application for how long copies of data are required. Though backups represent a 
simple form of disaster recovery, and should be part of any disaster recovery plan, backups by 
themselves should not be considered a complete disaster recovery plan. One reason for this is 
that not all backup systems are able to reconstitute a computer system or other complex 
configuration such as a computer cluster, active directory server, or database server by simply 
restoring data from a backup. 

Since a backup system contains at least one copy of all data considered worth saving, the 
data storage requirements can be significant. Organizing this storage space and managing the 
backup process can be a complicated undertaking. A data repository model may be used to 
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provide structure to the storage. Nowadays, there are many different types of data storage 
devices that are useful for making backups. There are also many different ways in which these 
devices can be arranged to provide geographic redundancy, data security, and portability. 

Before data are sent to their storage locations, they are selected, extracted, and 
manipulated. Many different techniques have been developed to optimize the backup 
procedure. These include optimization for dealing with open files and live data sources as 
well as compression, encryption, and deduplication, among others. Every backup scheme 
should include dry runs that validate the reliability of the data being backed up. It is important 
to recognize the limitations and human factors involved in any backup scheme. 


1. Selection and extraction of data 


A successful backup job starts with selecting and extracting coherent units of data. Most 
data on modern computer systems is stored in discrete units, known as files. These files are 
organized into file systems. Files that are actively being updated can be thought of as “live” 
and present a challenge to back up. It is also useful to save metadata that describes the 
computer or the file system being backed up. 

Deciding what to back up at any given time is a harder process than it seems. By backing 
up too much redundant data, the data repository will fill up too quickly. Backing up an 
insufficient amount of data can eventually lead to the loss of critical information. 


11 Files 


1.1.1 Copying files 

With file-level approach, making copies of files is the simplest and most common way to 
perform a backup. A means to perform this basic function is included in all backup software 
and all operating systems. 

1.12 Partial file copying 

Instead of copying whole files, one can limit the backup to only the blocks or bytes 
within a file that have changed in a given period of time. This technique can use substantially 
less storage space on the backup medium, but requires a high level of sophistication to 
reconstruct files in a restore situation. Some implementations require integration with the 
source file system. 

1.1.3 Deleted files 

To prevent the unintentional restoration of files that have been intentionally deleted, a 
record of the deletion must be kept. 
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1.2 File systems 


1.2.1 File system dump 

Instead of copying files within a file system, a copy of the whole file system itself in 
block-level can be made. This is also known as a raw partition backup and is related to disk 
imaging. The process usually involves unmounting the file system and running a program like 
dd (Unix). Because the disk is read sequentially and with large buffers, this type of backup 
can be much faster than reading every file normally, especially when the file system contains 
many small files, is highly fragmented, or is nearly full. But because this method also reads 
the free disk blocks that contain no useful data, this method can also be slower than 
conventional reading, especially when the file system is nearly empty. Some file systems 
provide a “dump” utility that reads the disk sequentially for high performance while skipping 
unused sections. The corresponding restore utility can selectively restore individual files or 
the entire volume at the operator’s choice. 

1.2.2 Identification of changes 

Some file systems have an archive bit for each file that says it was recently changed. 
Some backup software looks at the date of the file and compares it with the last backup to 
determine whether the file was changed. 

1.2.3 Versioning file system 

A versioning file system keeps track of all changes to a file and makes those changes 
accessible to the user. Generally this gives access to any previous version, all the way back to 
the file's creation time. An example of this is the Wayback versioning file system for Linux. 


1.3 Live data 


If a computer system is in use while it is being backed up, the possibility of files being 
open for reading or writing is real. If a file is open, the contents on disk may not correctly 
represent what the owner of the file intends. This is especially true for database files of all 
kinds. The term fuzzy backup can be used to describe a backup of live data that looks like it 
ran correctly, but does not represent the state of the data at any single point in time. This is 
because the data being backed up changed in the period of time between when the backup 
started and when it finished. For databases in particular, fuzzy backups are worthless. 

1.3.4 Snapshot backup 

A snapshot is an instantaneous function of some storage systems that presents a copy of 
the file system as if it were frozen at a specific point in time, often by a copy-on-write 
mechanism. An effective way to back up live data is to temporarily quiesce them (e.g. close 
all files), take a snapshot, and then resume live operations. At this point the snapshot can be 
backed up through normal methods. While a snapshot is very handy for viewing a filesystem 
as it was at a different point in time, it is hardly an effective backup mechanism by itself. 
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1.3. Open file backup 

Many backup software packages feature the ability to handle open files in backup 
operations. Some simply check for openness and try again later. File locking is useful for 
regulating access to open files. 

When attempting to understand the logistics of backing up open files, one must consider 
that the backup process could take several minutes to back up a large file such as a database. 
In order to back up a file that is in use, it is vital that the entire backup represent a 
single-moment snapshot of the file, rather than a simple copy of a read-through. 

1.3.3 Cold database backup 

During a cold backup, the database is closed or locked and not available to users. The 
data files do not change during the backup process so the database is in a consistent state 
when it is returned to normal operation. 

1.3.4. Hot database backup 

Some database management systems offer a means to generate a backup image of the 
database while it is online and usable (hot). This usually includes an inconsistent image of the 
data files plus a log of changes made while the procedure is running. Upon a restore, the 
changes in the log files are reapplied to bring the copy of the database up-to-date (the point in 
time at which the initial hot backup ended). 


2. Managing the backup process 


As long as new data are being created and changes are being made, backups will need to 
be performed at frequent intervals. Individuals and organizations with anything from one 
computer to thousands of computer systems all require protection of data. The scales may be 
very different, but the objectives and limitations are essentially the same. Those who perform 
backups need to know how successful the backups are, regardless of scale. 


2.1 Objectives 


2.1.1 Recovery point objective (RPO) 

The point in time that the restarted infrastructure will reflect. Essentially, this is the 
roll-back that will be experienced as a result of the recovery. The most desirable RPO would 
be the point just prior to the data loss event. Making a more recent recovery point achievable 
requires increasing the frequency of synchronization between the source data and the backup 
repository. 

2.1.2 Recovery time objective (RTO) 

The amount of time elapsed between disaster and restoration of business functions. 
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2.2 Implementation 


2.21 Scheduling 
Using a job scheduler can greatly improve the reliability and consistency of backups by 


removing part of the human element. Many backup software packages include this 


functionality. 


2.22 Authentication 
Over the course of regular operations, the user accounts and/or system agents that 


perform the backups need to be authenticated at some level. The power to copy all data off of 


or onto a system requires unrestricted access. Using an authentication mechanism is a good 


way to prevent the backup scheme from being used for unauthorized activity. 
2.23 Chain of trust 
Removable storage media are physical items and must only be handled by trusted 
individuals. Establishing a chain of trusted individuals (and vendors) is critical to defining the 


security of the data. 

全 New Words 
archive ['a:kaiv] 
recover [rikAva] 
disaster [di'za:ste] 
reconstitute Lri:'Konstitju:t] 
redundancy [ri'dandensi] 
portability Lpo:te'bilati] 
compression [kem pref en] 
deduplication [di.dju:pli'keif en] 
extraction [iks'traekf en] 
discrete [dis'kri:t] 
file [fail] 
metadata ['metedeita] 
redundant [ri'dandent] 
insufficient Linse'fif ent] 
reconstruct Lri:ken'strAkt] 
unintentional [Anin'tenf enl] 
buffer Ub^fe] 


wt 存档 

.档案 文件 

vt 重新 获得 ， 恢 复 

n KE, RK, KIR 
Ww. 重 新 组 成 ， 重 新 设立 
na 

n. 可 携带 ， 轻 便 
[S C ME 

1. 数据 去 重 ， 删 除 重复 数据 
nn. 抽 出， 取出 
adf. 不 连续 的 ， 离 散 的 
n. 文 件 

.元 数据 

adj. 多 余 的 
adj. 不 足 的 ， 不 够 的 
n E 

DE 

n. 不 是 故意 的 ， 无 心 的 
nw Kk, RB 


fragmented [freeg'mentid] 
fuzzy [fazi] 
worthless ['we:8lis] 
snapshot ['snæpf ət] 
instantaneous Linstən'teinjəs] 
inconsistent Linken'sistent] 
up-to-date [Apta deit] 
unrestricted Lanris'triktid] 
XA Phrases 


data retention 

data repository model 
dry run 

disk imaging 
copy-on-write 
software package 

file locking 
regardless of 

job scheduler 


XA Abbreviations 


RPO (Recovery Point Objective) 
RTO (Recovery Time Objective) 


XA Exercises 


【 Ex.5 】 根据 课文 内 容 回答 问题 。 


adj. 成 碎片 的 ， 片 断 的 
adj. WA fj 

adi. 无 价值 的 ， 无 益 的 

n AK 

a 丰 瞬间 的 ， 即 刻 的 ， 即 时 的 
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adj. — WI, REN, AEN 
adj. 直 到 现在 的 ， 最 近 的 ， 当 代 的 


adj. 无 限制 的 ， 自 由 的 


数据 保持 

数据 仓库 模型 
X1, HAR 

磁盘 镜像 
写 时 拷贝 ， 写 时 复制 
软件 包 ， 程 序 包 
文件 锁定 

AG, FB 

作业 安排 


恢复 点 目标 
复原 时 间 目 标 


1. What does a backup refer to in information technology? 
2. How many distinct purposes do backups have? What are they? 


3. What is one reason that backups by themselves should not be considered a complete 


disaster recovery plan? 


4. What does a successful backup job start with? 


5. What is the simplest and most common way to perform a backup with file-level 


approach? 


6. What is also known as a raw partition backup? 
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7. What does a versioning file system do? 

8. What can the term fuzzy backup be used to do? 

9. What is vital in order to back up a file that is in use? 

10. What is a good way to prevent the backup scheme from being used for unauthorized 
activity? 


参考 译文 
数据 仓库 常见 问题 


Q1. 什么 是 数据 仓库 ? 它 与 数据 库 管理 系统 (DBMS) 有 何不 同 ? 


数据 仓库 是 一 个 数据 库 ， 为 用 户 提供 从 联机 事务 处 理 系统 、 批 处 理 系统 以 及 外 部 联 
合 数据 中 提取 的 数据 。 

相 比 之 下 ， 数 据 库 管理 系统 是 控制 数据 库 中 数据 的 软件 。 它 提供 数据 安全 性 、 数 据 
完整 性 、 交 互 式 查询 、 交 互 式 的 数据 录入 和 更 新 ， 以 及 数据 的 独立 性 。 


Q2. 怎么 知道 我 的 组 织 是 否 需要 一 个 数据 仓库 ? 


使 用 数据 仓库 的 理想 人 选 显示 三 个 共同 的 特点 : 他 们 处 在 一 个 高 度 竞争 的 行业 、 他 
们 要 处 理 大 量 的 数据 以 及 他 们 正 努 力 对 广泛 分 散 的 数据 进行 整合 。 如 果 您 所 在 的 组 织 符 
合 这 些 特征 ， 就 可 以 从 实施 数据 仓库 中 获 益 。 


Q3. 如 何 设计 、 构 建 和 实施 数据 仓库 ? 


Atre 集团 已 确定 了 12 个 明确 的 步骤 ， 称 为 “数据 仓库 导航 ”。 它 们 是 : 
(1) 确定 用 户 的 需求 

(2) 确定 DBMS 服务 器 平台 

G) 确定 硬件 平台 

(4) 信息 和 数据 建 模 

(5) 构建 元 数据 存储 库 

(6) 数据 采集 和 清理 

CI) 数据 变换 、 传 输 和 迁移 

(8) 确定 中 间 件 连接 

(9) 原型 、 查 询 和 报告 
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(10) 数据 挖掘 
(11) 在 线 分 析 处 理 COLAP) 
(12) 部 署 和 系统 管理 


Q4. 哪 一 个 步骤 花费 的 时 间 最 长 ? 


可 能 因 组 织 而 异 ， 但 在 大 多 数 的 情况 下 ， 提 取 、 转 换 和 加 载 CETL) 最 耗费 时 间 。 
这 有 多 种 原因 : 

(1) 数据 分 散在 各 种 不 同 数据 源 并 且 重 复 存储 在 不 同 的 文件 中 。 大 多 数组 织 无 需 权 
威 的 文件 来 确定 数据 的 准确 来 源 。 识 别 用 于 数据 仓库 的 正确 数据 源 是 一 项 艰巨 的 任务 。 

(2) 在 大 多 数组 织 中 ， 有 各 种 版 本 的 所 谓 的 元 数据 存储 库 。 元 数据 是 关于 数据 的 数 
据 。 整 理 元 数据 库 中 的 信息 非常 耗 时 。 

GO 数据 的 质量 有 很 多 不 足 之 处 。 不 准确 的 数据 或 不 一 致 的 数据 都 是 脏 数据 。 男 
外 ， 确 定 哪些 数据 是 干净 的 、 哪 些 是 脏 的 颇具 挑战 性 。 为 了 找 出 区 别 ， 需 要 与 熟悉 业 
务 规 则 的 业务 代表 共同 努力 。 最 好 的 业务 代表 总 是 很 忙 。 因 而 ， 要 获得 他 们 的 关照 非 
常 困难 。 

(4) 识别 数据 仓库 中 需要 分 析 的 数据 很 费时 。 

(5) 这 一 子 步骤 是 : 提取 、 整 理 、 协 调 、 集 中 和 汇总 数据 。 其 中 每 个 步骤 都 需要 
时 间 。 

因此 ，ETL 的 整个 过 程 持续 时 间 最 长 。 


Q5. 什么 是 数据 挖掘 ? 它 是 数据 仓库 工作 的 一 部 分 吗 ? 


数据 挖掘 是 在 不 易 赁 直觉 或 经 验 察觉 的 数据 中 寻找 模式 。 数 据 挖掘 可 以 是 数据 仓库 
工作 的 一 部 分 ， 也 可 以 是 一 个 单独 的 活动 。 

数据 仓库 和 数据 挖掘 的 主要 区 别 是 : 在 大 多 数 情况 下 ， 使 用 数据 仓库 时 使 用 概要 数 
据 ， 而 数据 挖掘 需要 详细 数据 。 使 用 概要 性 的 数据 通常 会 丢失 模式 。 

在 若干 组 织 中 ， 认 为 数据 挖掘 是 一 个 数据 仓库 工作 的 一 部 分 。 


Q6. OLAP 是 什么 ? 


OLAP 代表 联机 分 析 处 理 ， 它 是 处 理 大 量 数据 的 一 种 技术 ， 目 的 是 为 了 进行 业务 分 
析 。OLAP 的 基本 目标 是 指数 级 地 减少 查询 或 阅读 业务 数据 所 花费 的 时 间 。 它 与 通常 被 
称 为 OLTP 联机 事务 处 理 ) 的 操作 处 理 根本 不 同 ，OLTP 的 建立 是 为 了 实现 更 好 的 写 
入 性 能 。OLAP 服务 器 处 理 数据 摘要 以 便 预先 确定 “要 是 …… 又 怎样 ”的 分 析 结 果 。 通 
常情 况 下 ,OLAP 服务 器 从 数据 仓库 中 提取 数据 ,然后 汇总 数据 并 将 其 组 织 成 多 维 结构 ， 
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俗称 立方 体 。 多 维 数据 结构 〈 或 立方 体 ) 简单 、 高 效 地 实现 用 户 的 规则 化 复杂 查询 、 排 
列 报表 数据 、 把 概要 数据 转换 为 详细 数据 并 将 数据 过 滤 或 划分 为 有 意义 的 子 集 。 


Q7. 仓库 设备 是 什么 ? 


数据 仓库 设备 结合 硬件 和 软件 产品 ， 专 门 用 于 分 析 处 理 。 一 个 设备 允许 买方 部 署 一 
个 高 性 能 的 数据 仓库 ， 实 现 开 箱 即 用 。 

在 传统 的 数据 仓库 实施 中 ， 数 据 库 管理 员 (DBA) 要 花费 大 量 的 时 间 调整 和 设置 来 
自 数据 库 的 数据 结构 ， 使 其 性 能 良好 。 但 是 ， 使 用 数据 仓库 设备 ， 供 应 商 负责 简化 物理 
数据 库 设 计 层 ， 并 确保 软件 与 硬件 相 适 应 。 

当 传 统 的 数据 仓库 需要 扩展 时 ， 管 理 员 需 要 把 所 有 数据 迁移 到 一 个 更 大 的 、 更 健壮 
的 服务 器 上 。 当 需要 扩展 数据 仓库 设备 时 ， 该 设备 可 以 简单 地 通过 购买 额外 的 即 插 即 用 
组 件 来 实现 扩展 。 

数据 仓库 设备 都 带 有 自己 的 操作 系统 、 存 储 器 、 数 据 库 管 理 系统 (DBMS ) 和 软件 。 
它 采用 大 规模 并 行 处 理 CMPP) 和 将 数据 分 布 到 集成 磁盘 存储 器 ， 这 样 独立 的 处 理 器 就 
可 以 并 行 查询 数据 ， 很 少 有 争 用 和 宛 余 组 件 失 效 ， 不 会 损害 整个 平台 。 数 据 仓库 设 备 使 
用 开放 式 数据 库 连 接 (ODBC) 、Java 数据 库 连 接 (DBC) 和 OLE DB 接口 并 与 提取 - 
转换 -加 载 (ETL) 工具 和 商业 智能 (BI) 或 商业 分 析 (BA) 应 用 相 集成 。 

目前 ， 较 小 的 数据 仓库 设备 厂商 似乎 注重 增加 产品 功能 (如 内 存 分 析 )， 以 便 能 与 
大 型 供应 商 竞争 。 然而 , 据 预 计 , 所 有 的 厂商 都 朝 着 以 下 趋势 发 展 : 价格 低廉 、 高 性 能 、 
使 用 普通 硬件 和 开源 软件 的 可 扩展 虚拟 化 数据 仓库 。 


Text A 


Data Preprocessing 


Data preprocessing is a data mining technique that involves transforming raw data into 
an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in 
certain behaviors or trends, and is likely to contain many errors. Data preprocessing is a 
proven method of resolving such issues. Data preprocessing prepares raw data for further 
processing. Data goes through a series of steps during preprocessing. 


1. Data Cleansing 


Data cleansing, also known as data scrubbing, is the process of ensuring that a set of data 
is correct and accurate. During this process, records are checked for accuracy and consistency, 
and they are either corrected or deleted as necessary. This can occur within a single set of 
records or between multiple sets of data that need to be merged or that will work together. 

At its most simple form, data cleansing involves a person or persons reading through a 
set of records and verifying their accuracy. Typos and spelling errors are corrected, 
mislabeled data is properly labeled and filed, and incomplete or missing entries are completed. 
These operations often purge out-of-date or unrecoverable records so that they do not take up 
space and cause inefficient operations. 

In more complex operations, data cleansing can be performed by computer programs. 
These programs can check the data with a variety of rules and procedures decided upon by the 
user. A program could be set to delete all records that have not been updated within the 
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previous five years, correct any misspelled words and delete any duplicate copies. A more 
complex program might be able to fill in a missing city based on a correct postal code or 
change the prices of all items in a database to another type of currency. 


2. Data Integration 


Data integration is the merging of multiple data sources into a single data source. This 
practice is often very time-consuming and involved, as the different data sources are likely 
incompatible with one another. Things as simple as different column names on a spreadsheet 
are enough to require date reformatting. This process is most common in situations where two 
groups started with no connection, but are placed together after they have worked 
independently. Data integration has become a more important topic due to the prevalence of 
free data sources and online databases. 

The data part of data integration can be almost anything as long as it is stored in a 
computer system. The actual content of the data is rarely as important as the way in which the 
data is stored. Most of the time, the data is kept in databases, organized systems of 
information. These systems contain unique entries and fields that allow users to find 
information quickly. 

The biggest hurdle to any data integration process is the data itself. In many cases, when 
the data was first set up, there was no intention of ever merging the dataset with another. This 
means that even though two datasets may refer to the same thing, they are totally 
incompatible. 

Nearly anything will make databases incompatible. Something as simple as a difference 
in presentation, such as field order or column width, can be enough to prevent an easy merger. 
When the data is significantly different, such as one database that contains more or less 
information, the merging is much more difficult. 

The two situations that call for data integration more than any other are in the business 
and the research fields. In the business world, merging departments or companies requires 
combining the previously separate information into a single structure. This form of integration 
is generally very difficult unless the original groups used similar software and had similar 
information goals. 

When data integration is performed for research purposes, it generally goes much 
smoother. When one researcher gives access to his information to another, the two parties are 
generally looking into the same process. This means they will use similar methods to catalog 
and store their data. 

In the past, data integration was a relatively minor area of data studies, but this has 
changed since the early part of the 21st century. With free online databases becoming more 
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popular and accurate, companies are scrambling to get their information in a sharable format. 
This allows them to both release their information in a public form and to integrate private 
versions of well-known public interfaces into their systems. 


3. Data Transformation 


Data transformation is the process of converting information or data from one format to 
another format. While the strategy is often thought of in terms of converting documents from 
one format to another, data transformations may also involve converting programs from one 
type of computer language to a different format in order to allow the program to run on a 
specific platform. The actual transformation may involve converting multiple data streams 
into a common format, or converting a single format into multiple different forms for use 
across a wide spectrum of platforms. 

The process of data transformation involves the use of what is known as SQL, or 
structured query language. SQL is the computer language that is responsible for managing the 
information that resides in some type of data management system. 

In actual use, data transformation involves the use of an executable program that is 
capable of reading the base or original language of the data, and identifying the language or 
languages that the data must translate into in order to be used by other programs. Once the 
mapping for the transformation is accomplished, the program then converts the data into the 
single or multiple formats desired, and distributes the converted data accordingly. With many 
applications, this takes place in a matter of seconds. 

A similar process is known as data mediation. Like data transformation, the idea is to 
make data in one format to be usable in another format. One difference with mediation is that 
the data mapping process involves the creation of what is known as a data model, serving as 
an intermediary between the two formats involved, rather than the direct translation that 
occurs with the transformation of information. 

As with many types of computer technology, data transformation is a process that is 
continually evolving as new programs help to increase the efficiency and scope of how 
information can be translated. As more programs and formats are included in this process, the 
ability to share data across many different platforms that were once totally incompatible has 
increased significantly. In a global setting where collaborators may not always make use of 
the same programs or languages as the foundation for their data systems, these continual 
improvements mean that there is significantly less time needed to manually translate and enter 
data between systems. 
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4. Data Reduction 


Data reduction is the transformation of numerical or alphabetical digital information 
derived empirically or experimentally into a corrected, ordered, and simplified form. The 
basic concept is the reduction of multitudinous amounts of data down to the meaningful parts. 

When information is derived from instrument readings there may also be a 
transformation from analog to digital form. When the data are already in digital form the 


s 


“reduction” of the data typically involves some editing, scaling, coding, sorting, collating, 
and producing tabular summaries. When the observations are discrete but the underlying 
phenomenon is continuous, smoothing and interpolation are often needed. Often the data 
reduction is undertaken in the presence of reading or measurement errors. Some idea of the 
nature of these errors is needed before the most likely value may be determined. 

An example in astronomy is the data reduction in the Kepler satellite. This satellite 
records 95-megapixel images once every six seconds, generating tens of megabytes of data 
per second, which is orders of magnitudes more than the downlink bandwidth of 550 KBps. 
The on-board data reduction encompasses co-adding the raw frames for thirty minutes, 
reducing the bandwidth by a factor of 300. Furthermore, interesting targets are preselected 
and only the relevant pixels are processed, which is 6% of the total. This reduced data is then 
sent to Earth where it is processed further. 

Research has also been carried out on the use of data reduction in wearable (wireless) 
devices for health monitoring and diagnosis applications. For example, in the context of 
epilepsy diagnosis, data reduction has been used to increase the battery lifetime of a wearable 
EEG device by selecting, and only transmitting, EEG data that is relevant for diagnosis and 
discarding background activity. 


XW New Words 
preprocess [pri:'prauses] Vt. 预 加 工 ， 预 处 理 
transform [treens'fo:m] VL 转换， 改变 ; 使 …… 变 形 

vi. 改变 ， 转 化 ， 变 换 

understandable [Ande'steendebl] adj. 可 以 理解 的 ， 能 懂 的 
incomplete Linkam'pli:t] adj. 不 完全 的 ,不 完善 的 
accurate ['ekjurit] adj. 正 确 的 ， 精 确 的 
accuracy [ekjurasi] 1. 精确 性 ， 正 确 度 
consistency [ken'sistensi] n.— BE 
verify [ verifai] vt AER, FUR 


typos [taipeus] .打字 稿 


mislabel 
purge 


out-of-date 
unrecoverable 
inefficient 
update 


misspell 
integration 
time-consuming 
involved 
incompatible 
reformat 
connection 
independently 
prevalence 
unique 

hurdle 
sharable 
transformation 
convert 
strategy 
spectrum 
executable 
mapping 
accomplish 
distribute 
continually 
collaborator 
alphabetical 
experimentally 
simplify 
instrument 
collate 
observation 


phenomenon 


[mis'eibl] 


[pa:d3] 


[autev'deit] 
[Anri'Kaverebl] 
Lini'fif ent] 

[^p deit] 

[^p deit] 
[mis'spel] 
Linti'greif en] 
[taimken,sju:min] 
[in volvd] 
Linkem'petabl] 
[ri'fo:meet] 
[ke'nekf an] 
[indi'pendentli] 
[prevelens] 
[ju'ni:k] 

[ha:dl] 
[JSeerabl] 
[treensfa'meif en] 
[ken ve:t] 
[straetidzi] 
[spektrem] 

[ eksikju:tebl] 
[meepin] 
[e’komplif] 
[dis'tribju(:)t] 
[ken'tinjueli] 
[ke laebereita] 
Lelfe'betikal] 
[iksperi'menteli] 
[ simplifai] 
[instrument] 
[ko'leit] 
Lebze'veif en] 
[finominan] 
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vM AE E 

净化， 清除 

v. (使) 净化， 清除 

adj. 过 期 的 ， 过 时 的 ， 落 伍 的 
adj. 不 可 恢复 的 

adj. 效 率 低 的 ， 效 率 差 的 

vem, RE 

nn. 更 新 ; 现代 化 ; 更 新 的 信息 

vt. 拼 错 

17. 整合 ， 综 合 

adj E IW E] 

adj RF IN; ARN 
adfy. 不 兼容 的 ， 了 矛盾 的 ， 不 调和 的 
vt. 重 定格 式 ， 重 新 格式 化 
nt, KA 

adv. 独 立地 

7. 流行 

adj. 唯 一 的 ， 独 特 的 

.障碍 

adj. 可 共享 的 ， 可 分 享 的 ， 可 分 担 的 
nn. 变 化 ， 转 化 

VL 使 转变 ， 转 换 

.策略 

.频谱 ， 波 谱 ; 范围 
adj. 可 执行 的 ， 可 实行 的 

nn. 映射 

Vi 完成, 达到， 实现 

VR, DR, Dt, DR, OK 
adv. 不 断 地 ， 频 繁 地 

ne tee 

adj. 字 母 的 

adv. 实 验 上 ， 用 实验 方法 

vt 单一 化 ， 简 单 化 

n.LR, FR, BA 

v. 

n I, 观测 ; 观察 资料 (或 报告 ) 
.现象 
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smooth [smu:6] 
interpolation [inte:peleif en] 
astronomy [e'stronami] 
satellite [saetalait] 
megapixel [megepiksel] 
megabyte [megebait] 
magnitude [maegnitju:d] 
downlink [daunlink] 
encompass [in'Kampes] 
factor [feekta] 
target [ta:git] 
preselect [pri:sillekt] 
pixel [piksel] 
wireless [waielis] 
diagnosis [;daieg'neusis] 
discard [dis'ka:d] 
从 Phrases 
data preprocessing 
data mining 
raw data 
lacking in 


data cleansing 
spelling error 

take up 

duplicate copy 
postal code 

data integration 

as long as 

set up 

look into 

data transformation 
data stream 

be responsible for ... 


ad .平滑 的 ， 平 稳 的 ， 流 畅 的 
vt 使 光滑 

vi. ER 

nn. 插 补 

.天 文学 
nrg 

DES E 

nn. 兆 字 节 

n.k/h, RE, ER 
n. FATEH 

ve, 包括 

n.E X, BR 
.目标 

vt. 预先 选择 

n.f& X 

adj. X, & i 

n.i Br 

WELK, WH, KF 


副本 ， 复 制 本 
邮政 编码 
数据 整合 
只 要 ， 在 …… 的 时 候 
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data management system 数据 管理 系统 
be capable of 能 够 
translate into 转化 为 ， 翻 译 成 
amatter of 大 约 ， 大 概 
data mediation 数据 调节 
data reduction 数据 缩减 
be derived from 源 自 于 

X Abbreviations 
KBps (Kilo-Bytes Per Second) 每 秒 千 字 节 
EEG ( Electro EncephaloGram) 脑 电 图 

XA Notes 


[1] A program could be set to delete all records that have not been updated within the 
previous five years, correct any misspelled words and delete any duplicate copies. 


本 句 中，to delete all records that have not been updated within the previous five years, 
correct any misspelled words and delete any duplicate copies 是 动词 不 定式 短语 ， 作 目 
的 状语 。that have not been updated within the previous five years 是 一 个 定语 从 句 ， 修 
饰 和 限定 all records. 

[2] This process is most common in situations where two groups started with no connection, 
but are placed together after they have worked independently. 
本 名 中，where two groups started with no connection, but are placed together after they 
have worked independently 是 一 个 定语 从 句 ， 修 饰 和 限定 situations. 

[3] This means that even though two datasets may refer to the same thing, they are totally 
incompatible. 
本 句 中 ，that even though two datasets may refer to the same thing, they are totally 
incompatible 是 一 个 宾语 从 句 。 在 该 从 名 中 ，even though two datasets may refer to the 
same thing 是 一 个 让 步 状语 从 句 , 修饰 谓语 are totally incompatible. even though 的 意 
思 是 “即使 ”“ 尽 管 ”。 

[4] While the strategy is often thought of in terms of converting documents from one format 


to another, data transformations may also involve converting programs from one type of 
computer language to a different format in order to allow the program to run on a specific 
platform. 


本 句 中 , While the strategy is often thought of in terms of converting documents from one 


format to another 是 一 个 让 步 状语 从 句 。in order to allow the program to run on a 
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specific platform 作 目 的 状语 。 

[5] In actual use, data transformation involves the use of an executable program that is 
capable of reading the base or original language of the data, and identifying the language 
or languages that the data must translate into in order to be used by other programs. 

本 句 中 , in order to be used by other programs 作 目 的 状语 , 修饰 主 句 的 谓语 involves. 
that is capable of reading the base or original language of the data, and identifying the 
language or languages that the data must translate into 是 一 个 定语 从 句 ， 修 饰 和 限定 an 
executable program。 在 该 从 句 中 ，and 连接 了 is capable of 的 两 个 宾语 。that the data 
must translate into 是 一 个 定语 从 句 ， 修 饰 和 限定 the language or languages. 


XA Exercises 


[Ex.1] 根据 课文 内 容 回答 问题 。 

1. What is data preprocessing? 

2. What is data cleansing? 

3. What does data cleansing involve at its most simple form? 

4. What is data integration? 

5. What are the two situations that call for data integration more than any other? 

6. What is data transformation? 

7. What is SQL? 

8. What is the one difference between data transformation and data mediation? 

9. What is data reduction? 

10. When the data are already in digital form what do the “reduction’” of the data typically 
involve? 


【Ex. 2】 把 下 列 句 子 翻译 为 中 文 。 

1. Effective solutions involve optimized techniques and technologies to extract, filter, and 
transform data. 

2. The data on the replica may be inconsistent with the protected data so a consistency check 
is required. 

3. This provides consistency and reduces the number of errors. 

4. The class contains methods to insert, delete, and update a row or rowsfrom the database. 

5. Examine the output to verify that all the commands were processedsuccessfully. 

6. The backup device reported an unrecoverable hardware error. 

7. Update statistics is run only on logged databases. 

8. Reformat each record in the export file so that it can be used to modify each user account. 

9. The report must include a dataset that specifies a connection to the package. 
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10. Setting up sharable, independent neutral information model is the precondition of 


information integration. 


[Ex 3] 短文 翻译 。 

Data reduction is a term that applies to the business practice of accumulating, analyzing 
and ultimately transforming massive amounts of data into a series of summarized reports. The 
idea behind the data reduction process is to provide a complete though somewhat simplified 
format that can be utilized with relative ease in business settings. Several different approaches 
to the process may be used, with the selection of data reduction techniques and systems 
depending on the nature of the data and how those summary reports need to be structured in 
order to provide a full and comprehensive representation of that data. 

One of the primary tasks in any type of data reduction effort is the organization of all 
data collected for the purpose. At times this portion of the process focuses on establishing 
some sort of order to the data that involves prioritizing in a consistent manner, using 
well-defined criteria to aid in the activity. Depending on the type of data involved, it is not 
unusual to include some rounding of certain figures in order to make the information easier to 
work with during the summarizing. Finally, the arrangement of the data into tables, columnar 
reports, or other types of labeling or formatting may be necessary in order to allow recipients 
of the reports to follow the logistics of the simplified information with relative ease. 


[Ex 4】 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


Sorting reduction images processed monitoring 
nature increase diagnosis transformation encompasses 


Data Reduction 


Data reduction is the transformation of numerical or alphabetical digital information 
derived empirically or experimentally into a corrected, ordered, and simplified form. The 
basic concept is the — (1) _ of multitudinous amounts of data down to the meaningful 
parts. 

When information is derived from instrument readings there may also be a (2) from 
analog to digital form. When the data are already in digital form the “reduction” of the data 
typically involves some editing, scaling, coding, — (3) , collating, and producing tabular 
summaries. When the observations are discrete but the underlying phenomenon is continuous 
then smoothing and interpolation are often needed. Often the data reduction is undertaken in 
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the presence of reading or measurement errors. Some idea of the _ (4) _ of these errors is 
needed before the most likely value may be determined. 

An example in astronomy is the data reduction in the Kepler satellite. This satellite 
records 95-megapixel _ (3) once every six seconds, generating tens of megabytes of data 
per second, which is orders of magnitudes more than the downlink bandwidth of 550 KBps. 
The on-board data reduction (6) co-adding the raw frames for thirty minutes, reducing 
the bandwidth by a factor of 300. Furthermore, interesting targets are pre-selected and only 
the relevant pixels are processed, which is 6% of the total. This reduced data is then sent to 
Earth where itis _ (7) | further. 

Research has also been carried out on the use of data reduction in wearable (wireless) 
devices for health _ (8) ^ and diagnosis applications. For example, in the context of 
epilepsy diagnosis, data reduction has been used to __(9) the battery lifetime of a wearable 
EEG device by selecting, and only transmitting, EEG data that is relevant for — (10) and 
discarding background activity. 


Text B 


Data Cleansing 


Data cleansing or data cleaning is the process of detecting and correcting (or removing) 
corrupt or inaccurate records from a record set, table, or database. It refers to identifying 
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, 
or deleting the dirty or coarse data. Data cleansing may be performed interactively with data 
wrangling tools, or as batch processing through scripting. 

After cleansing, a data set should be consistent with other similar data sets in the system. 
The inconsistencies detected or removed may have been originally caused by user entry errors, 
by corruption in transmission or storage, or by different data dictionary definitions of similar 
entities in different stores. Data cleansing differs from data validation in that validation almost 
invariably means data is rejected from the system at entry and is performed at the time of 
entry, rather than on batches of data. 

The actual process of data cleansing may involve removing typographical errors or 
validating and correcting values against a known list of entities. The validation may be strict 
(such as rejecting any address that does not have a valid postal code) or fuzzy (such as 
correcting records that partially match existing, known records) Some data cleansing 
solutions will clean data by cross checking with a validated data set. A common data 
cleansing practice is data enhancement, where data is made more complete by adding related 
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information. For example, appending addresses with any phone numbers related to that 
address. Data cleansing may also involve activities like harmonization of data and 
standardization of data. For example, harmonization of short codes (st, rd, etc.) to actual 
words (street, road, etcetera). Standardization of data is a means of changing a reference data 


set to a new standard, for example, use of standard codes. 


1. Motivation 


Administratively, incorrect or inconsistent data can lead to false conclusions and 
misdirected investments on both public and private scales. For instance, the government may 
want to analyze population census figures to decide which regions require further spending 
and investment on infrastructure and services. In this case, it will be important to have access 
to reliable data to avoid erroneous fiscal decisions. In the business world, incorrect data can 
be costly. Many companies use customer information databases that record data like contact 
information, addresses, and preferences. For instance, if the addresses are inconsistent, the 
company will suffer the cost of resending mail or even losing customers. The profession of 
forensic accounting and fraud investigating uses data cleansing in preparing its data and is 
typically done before data is sent to a data warehouse for further investigation. There are 
packages available so you can cleanse/wash address data while you enter it into your system. 
This is normally done via an application programming interface (API). 


2. Data Quality 


High-quality data needs to pass a set of quality criteria. Those include: 
2. Validity 


The degree to which the measures conform to defined business rules or constraints. 
When modern database technology is used to design data-capture systems, validity is fairly 
easy to ensure: invalid data arises mainly in legacy contexts (where constraints were not 
implemented in software) or where inappropriate data-capture technology was used (e.g., 
spreadsheets, where it is very hard to limit what a user chooses to enter into a cell, if cell 
validation is not used). Data constraints fall into the following categories: 

* Data-type constraints—e.g., values in a particular column must be of a particular data 

type, e.g., Boolean, numeric (integer or real), date, etc. 

* Range constraints: typically, numbers or dates should fall within a certain range. That 

is, they have minimum and/or maximum permissible values. 

* Mandatory constraints: Certain columns cannot be empty. 
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* Unique Constraints: A field, or a combination of fields, must be unique across a 
dataset. For example, no two persons can have the same social security number. 

e Set-membership constraints: The values for a column come from a set of discrete 
values or codes. For example, a person’s gender may be Female, Male or Unknown 
(not recorded). 

* Foreign-key constraints: This is the more general case of set membership. The set of 
values in a column is defined in a column of another table that contains unique values. 
For example, in a US taxpayer database, the“ state "column is required to belong to one 
of the US’s defined states or territories: the set of permissible states/territories is 
recorded in a separate States table. The term foreign key is borrowed from relational 
database terminology. 

* Regular expression patterns: Occasionally, text fields will have to be validated this 
way. For example, phone numbers may be required to have the pattern (999) 
999-9999. 

* Cross-field validation: Certain conditions that utilize multiple fields must hold. For 
example, in laboratory medicine, the sum of the components of the differential white 
blood cell count must be equal to 100 (since they are all percentages). In a hospital 
database, a patient's date of discharge from hospital cannot be earlier than the date of 
admission. 


2.2 Accuracy 


The degree of conformity of a measure to a standard or a true value. Accuracy is very 
hard to achieve through data-cleansing in the general case, because it requires accessing an 
external source of data that contains the true value: such “gold standard" data is often 
unavailable. Accuracy has been achieved in some cleansing contexts, notably customer 
contact data, by using external databases that match up zip codes to geographical locations 
(city and state), and also help verify that street addresses within these zip codes actually exist. 


2.3 Completeness 


The degree to which all required measures are known. Incompleteness is almost 
impossible to fix with data cleansing methodology: one cannot infer facts that were not 
captured when the data in question was initially recorded. In some contexts, e.g., interview 
data, it may be possible to fix incompleteness by going back to the original source of data, i,e., 
re-interviewing the subject, but even this does not guarantee success because of problems of 
recall —e.g., in an interview to gather data on food consumption, no one is likely to 
remember exactly what one ate six months ago. In the case of systems that insist certain 
columns should not be empty, one may work around the problem by designating a value that 
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indicates“ unknown” or“ missing”, but supplying of default values does not imply that the data 
has been made complete. 


2.4 Consistency 


The degree to which a set of measures are equivalent in across systems. Inconsistency 
occurs when two data items in the data set contradict each other: e.g., a customer is recorded 
in two different systems with two different current addresses, and only one of them can be 
correct. Fixing inconsistency is not always possible: it requires a variety of strategies — e.g., 
deciding which data were recorded more recently, which data source is likely to be most 
reliable (the latter knowledge may be specific to a given organization), or simply trying to 
find the truth by testing both data items (e.g., calling up the customer). 


2.5 Uniformity 


The degree to which a set data measures are specified using the same units of measure in 
all systems. In datasets pooled from different locales, weight may be recorded either in 
pounds or kilos, and must be converted to a single measure using an arithmetic 
transformation. 

The term integrity encompasses accuracy, consistency and some aspects of validation but 
is rarely used by itself in data-cleansing contexts because it is insufficiently specific. For 
example, “referential integrity" is a term used to refer to the enforcement of foreign-key 
constraints above. 


3. Process 


3.1 Data auditing 


The data is audited using statistical and database methods to detect anomalies and 
contradictions: this eventually gives an indication of the characteristics of the anomalies and 
their locations. Several commercial software packages will let you specify constraints of 
various kinds (using a grammar that conforms to that of a standard programming language, 
e.g., JavaScript or Visual Basic) and then generate code that checks the data for violation of 
these constraints. For users who lack access to high-end cleansing software, Microcomputer 
database packages such as Microsoft Access or File Maker Pro will also let you perform such 
checks, on a constraint-by-constraint basis, interactively with little or no programming 
required in many cases. 
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3.2 Workflow specification 


The detection and removal of anomalies is performed by a sequence of operations on the 
data known as the workflow. It is specified after the process of auditing the data and is crucial 
in achieving the end product of high-quality data. In order to achieve a proper workflow, the 
causes of the anomalies and errors in the data have to be closely considered. 


3.3 Workflow execution 


In this stage, the workflow is executed after its specification is complete and its 
correctness is verified. The implementation of the workflow should be efficient, even on large 
sets of data, which inevitably poses a trade-off because the execution of a data-cleansing 
operation can be computationally expensive. 


3.4 Post-processing and controlling 


After executing the cleansing workflow, the results are inspected to verify correctness. 
Data that could not be corrected during execution of the workflow is manually corrected, if 
possible. The result is a new cycle in the data-cleansing process where the data is audited 
again to allow the specification of an additional workflow to further cleanse the data by 
automatic processing. 

Good quality source data has to do with “Data Quality Culture” and must be initiated at 
the top of the organization. It is not just a matter of implementing strong validation checks on 
input screens because almost no matter how strong these checks are, they can often still be 
circumvented by the users. There is a nine-step guide for organizations that wish to improve 
data quality: 

* Declare a high level commitment to a data quality culture 

* Drive process reengineering at the executive level 

* Improve the data entry environment 

* Improve application integration 

* Change how processes work 

* Promote end-to-end team awareness 

* Promote interdepartmental cooperation 

e Publicly celebrate data quality excellence 

* Continuously measure and improve data quality 


3.5 Parsing 


A parser decides whether a string of data is acceptable within the allowed data 
specification. This is similar to the way a parser works with grammars and languages. 
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3.6 Data transformation 


Data transformation allows the mapping of the data from its given format into the format 
expected by the appropriate application. This includes value conversions or translation 
functions, as well as normalizing numeric values to conform to minimum and maximum 
values. 


3.7 Duplicate elimination 


Duplicate detection requires an algorithm for determining whether data contains 
duplicate representations of the same entity. Usually, data is sorted by a key that would bring 
duplicate entries closer together for faster identification. 


3.8 Statistical methods 


By analyzing the data using the values of mean, standard deviation, range, or clustering 
algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. 
Although the correction of such data is difficult since the true value is not known, it can be 
resolved by setting the values to an average or other statistical value. Statistical methods can 
also be used to handle missing values which can be replaced by one or more plausible values, 
which are usually obtained by extensive data augmentation algorithms. 


4. System 


The essential job of this system is to find a suitable balance between fixing dirty data and 
maintaining the data as close as possible to the original data from the source production 
system. This is a challenge for the extract, transform, load architect. The system should offer 
an architecture that can cleanse data, record quality events and measure/control quality of data 
in the data warehouse. A good start is to perform a thorough data profiling analysis that will 
help define the required complexity of the data cleansing system and also give an idea of the 
current data quality in the source systems. 


5. Quality screens 


Part of the data cleansing system is a set of diagnostic filters known as quality screens. 
Quality screens are divided into three categories: 
* Column screens. Testing the individual column, e.g. for unexpected values like NULL 
values; non-numeric values that should be numeric; out of range values; etc. 
e Structure screens. These are used to test for the integrity of different relationships 
between columns (typically foreign/primary keys) in the same or different tables. 
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They are also used for testing that a group of columns is valid according to some 
structural definition to which it should adhere. 

e Business rule screens. The most complex of the three tests. They test to see if data, 
maybe across multiple tables, follow specific business rules. An example could be, 
that if a customer is marked as a certain type of customer, the business rules that 
define this kind of customer should be adhered to. 

When a quality screen records an error, it can either stop the dataflow process, send the 
faulty data somewhere else than the target system or tag the data. The latter option is 
considered the best solution because the first option requires, that someone has to manually 
deal with the issue each time it occurs and the second implies that data are missing from the 
target system (integrity) and it is often unclear what should happen to these data. 


6. Criticism of existing tools and processes 


The main reasons cited are: 

Project costs: costs typically in the hundreds of thousands of dollars 

* Time: lack of enough time to deal with large-scale data-cleansing software 

e Security: concerns over sharing information, giving an application access across 
systems, and effects on legacy systems 


7. Error event schema 


The Error Event schema holds records of all error events thrown by the quality screens. 
It consists of an Error Event Fact table with foreign keys to three dimension tables that 
represent date (when), batch job (where) and screen (who produced error). It also holds 
information about exactly when the error occurred and the severity of the error. In addition 
there is an Error Event Detail Fact table with a foreign key to the main table that contains 
detailed information about in which table, record and field the error occurred and the error 
condition. 


8. Challenges and problems 


8.1 Error correction and loss of information 


The most challenging problem within data cleansing remains the correction of values to 
remove duplicates and invalid entries. In many cases, the available information on such 
anomalies is limited and insufficient to determine the necessary transformations or corrections, 
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leaving the deletion of such entries as a primary solution. The deletion of data, though, leads 
to loss of information; this loss can be particularly costly if there is a large amount of deleted 
data. 


8.2 Maintenance of cleansed data 


Data cleansing is an expensive and time-consuming process. So after having performed 
data cleansing and achieving a data collection free of errors, one would want to avoid the 
re-cleansing of data in its entirety after some values in data collection have changed. The 
process should only be repeated on values that have changed; this means that a cleansing 
lineage would need to be kept, which would require efficient data collection and management 
techniques. 


8.3 Data cleansing in virtually integrated environments 


In virtually integrated sources like IBM's DiscoveryLink, the cleansing of data has to be 
performed every time the data is accessed, which considerably increases the response time 
and lowers efficiency. 


8.4 Data-cleansing framework 


In many cases, it will not be possible to derive a complete data-cleansing graph to guide 
the process in advance. This makes data cleansing an iterative process involving significant 
exploration and interaction, which may require a framework in the form of a collection of 
methods for error detection and elimination in addition to data auditing. This can be integrated 
with other data-processing stages like integration and maintenance. 


XW New Words 
correct [karekt] wt 改正， 纠正 
remove [rimu:v] WL 删除 ， 移 去 ， 移 动 
corrupt [ke'rapt] adj. 被 破坏 的 
table [teibl] ne, RH 
incorrect [inke'rekt] adj. 错 误 的 ， 不 正确 的 
irrelevant [irelivant] adj. 不 相关 的 
replace [ripleis] vt, EAR, NE 
modify [modifai] vt. E, Bk 
coarse [ko:s] adj FL i 
inconsistency [inken'sistensi] n AK, FG 


invariably [in veeriebli] adv. 不 变 地 ， 总 是 
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harmonization 
standardization 
infrastructure 
erroneous 
investigation 
inappropriate 
cell 

real 

minimum 


maximum 


mandatory 
taxpayer 
borrowed 
conformity 
unavailable 
contradict 
reliable 
uniformity 
insufficiently 
contradiction 
indication 
violation 
microcomputer 
interactively 
sequence 
correctness 
inevitably 
pose 
trade-off 
inspect 
automatic 


interdepartmental 
parse 
parser 


Lha:manaizeif ən] 
[.steendedai'zeif en] 
[infrestrAkt[ ə] 

[i reunies] 
[in.vesti'geif en] 
[ine'preupriet] 
[sel] 

[ri:el] 


[minimam] 


[maeksimem] 


[maendeteri] 
[taeks,peia] 
[boreud] 
[ken'fo:miti] 
[Ane veilebl] 
[kontre'dikt] 
[rilaiebl] 
[ju:ni'fo:miti] 
[insefif entli] 
[Kontre'dikf en] 
[indi'keif en] 
[.vaie'leif en] 
[maikreukempju:te] 
[inter'aektivli] 
[si:kwens] 
[ke'rektnes] 
[in'evitebli] 
[peuz] 
[treid-o:f] 
[in'spekt] 
[.2:te'meetik] 


Lintedi,pa:t'mentl] 
[pa:z] 
[pa:se] 


n.—E, B 

nn. 标 准 化 

nn. 基础 设施 
adj. 错 误 的 ， 不 正确 的 
nn. 调查 ,研究 
adj. 不 适当 的 ， 不 相称 的 
nn 单元 

adj. 实 际 的 
adj. 最 小 的 ， 最 低 的 
nn. 最 小 值 ， 最 小 化 
.最 大 量 ， 最 大 限度 
adi. 最 大 极限 的 
adj. 命 令 的 ， 强 制 的 
.纳税 人 

adj. 借 来 的 

.一致 ,符合 
adj. 难 以 获得 的 


adj. 可 靠 的 ， 可 信赖 的 
nn. 均 匀 性 

adv. 不 够 地 ， 不 能 胜任 地 
nF, RR 

nn. 指 出， 指示 ,迹象 ， 暗 示 
nn. 违反 ,违背 ,妨碍 

nn. 微 型 电子 计算 机 
adv. 交 互 式 地 

nok, 顺序， 序列 

1. 正确 性 
adv. 不 可 避免 地 

v. 形 成 ， 引 起 ， 造 成 
nt, FR 

.检查 

adj. 自 动 的 

.自动 机 械 
adj. 各 部 门 间 的 

Vt 解析 ， 分 解 
.解析 器 ， 解 释 器 


normalize [no:malaiz] 

duplicate [dju:plikit] 

elimination [ilimi'neif en] 

unexpected Laniks'pektid] 

plausible ['plo:zabl] 

augmentation Lo:gmen'teif en] 

cleanse [klenz] 

Screen [skri:n] 

diagnostic [.daieg'nostik] 

follow [foleu] 

severity [si veriti] 

maintenance [meintinens] 

iterative ['iteretiv] 

exploration [.eksplo:'reif en] 
XA Phrases 

data cleansing 

record set 

coarse data 


data wrangling 

data dictionary 
typographical error 
population census 
forensic accounting 
fraud investigating 
data type 

maximum permissible value 
social security number 
discrete value 

match up 

referential Integrity 
data auditing 

conform to 

in order to ... 

no matter how 
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vy 规格 化 

adj. 重 复 的 
nth, RA, HER, AR 
adj 想不到 的 ， 意 外 的 ， 未 预料 到 的 
adj. 似 是 而 非 的 

nn. 增 加 ， 增 强 

vy. 净 化， 提纯 

vt. fj it 

adj. 诊 断 的 

vt. 遵 特 

ne, FE 

nP, iF 

ad ERN, RAM; 和 迭代 的 
nn. 反 复 体 ， 循 环 体 

nn 探测， 探查 


数据 清理 

记录 集 

粗糙 数据 

数据 整理 

数据 字典 

排 字 错 误 ， 误 排 

人 口 普查 

法 律 财会 专业 ; 法 务 会 计 学 
KFE, HR 
数据 类 型 
最 大 容许 值 

社会 保险 号 码 
离散 值 ， 不 连续 值 
使 调 协 ， 使 配合 
参照 完整 性 
数据 审核 ， 数 据 审计 
fé, SER 
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missing value 

extensive data augmentation algorithm 
primary key 

foreign key 

error correction 

free of 

management techniques 

response time 


XA Exercises 


【Ex. 5】 根据 课文 内 容 回答 问题 。 
1. What is data cleansing ? 


遗漏 值 ， 漏 测 值 
扩展 数据 增强 算法 
主键 

^t 

纠 错 ， 数 据 纠正 
AN, FER 
管理 技术 ， 管 理 方 法 
响应 时 间 


2. Why does data cleansing differ from data validation? 
3. What can lead to false conclusions and misdirected investments on both public and private 


scales? 


4. What does high-quality data need to pass? 


5. Why is accuracy is very hard to achieve through data-cleansing in the general case? 
6. Why is good quality source data not just a matter of implementing strong validation checks 


on input screens? 


7. What is the essential job of this system? 


8. How many categories are quality screens divided into? What are they? 
9. What are the main reasons cited for criticism of existing tools and processes? 
10. What is the most challenging problem within data cleansing? 


参考 译文 


数据 预 处 理 


数据 预 处 理 是 一 种 数据 挖掘 技术 ， 它 将 原始 数据 转换 为 可 理解 的 格式 。 真 实 世界 的 
数据 通常 不 完整 、 不 一 致 和 /或 缺乏 某 些 行为 或 趋势 ,， 并且 可 能 包含 许多 错误 。 数 据 预 处 
理 是 解决 此 类 问题 的 经 过 检验 的 方法 。 数 据 预 处 理 准 备 原始 数据 以 供 进一步 处 理 。 数 据 
在 预 处 理 过 程 中 会 经 过 一 系列 的 步 又 。 
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1. 数据 清理 


数据 清理 〈 也 称 为 数据 清洗 ) 是 确保 一 组 数据 正确 及 准确 的 过 程 。 在 此 过 程 中 ， 检 
查 记 录 的 准确 性 和 一 致 性 ， 并 根据 需要 进行 更 正 或 删除 。 这 可 能 发 生 在 一 组 记录 中 ， 也 
可 能 发 生 在 需要 合并 或 将 协同 工作 的 多 组 数据 之 间 。 

最 简单 的 数据 清理 形式 是 一 个 人 或 一 些 人 阅读 一 组 记录 并 校 验 其 准确 性 。 纠正 错字 
和 拼写 错误 、 修 正 标记 错 了 的 数据 以 及 归档 并 完善 不 完整 或 缺失 的 条 目 。 这 些 操作 通常 
会 清除 过 期 或 不 可 恢复 的 记录 ， 以 免 它 们 占用 空间 并 降低 操作 效率 。 

更 复杂 的 数据 清理 操作 可 由 计算 机 程序 执行 。 这 些 程序 可 以 根据 用 户 确定 的 各 种 规 
则 和 程序 来 检查 数据 。 可 以 设置 一 个 程序 来 删除 在 过 去 五 年 内 未 更 新 的 所 有 记录 ， 更 正 
任何 有 拼写 错误 的 单词 ， 并 删除 全 部 重复 的 副本 。 更 复杂 的 程序 可 能 会 根据 正确 的 邮 
政 编码 填写 一 个 缺失 的 城市 ， 或 者 将 数据 库 中 所 有 项 目的 价格 更 改 为 以 其 他 类 型 货币 
的 计价 。 


2. 数据 整合 


数据 整合 是 将 多 个 数据 源 合并 成 单个 数据 源 。 这 种 操作 往往 非常 耗费 时 间 ， 因 为 不 
同 的 数据 源 可 能 彼此 不 兼容 。 例 如 像 电 子 表格 中 的 不 同 列 名 称 这 样 简单 的 事情 就 足以 要 
求 重新 格式 化 日 期 。 在 两 个 组 织 刚 开始 没有 联网 、 已 经 独立 工作 后 才 联网 的 情况 下 ， 这 
个 过 程 很 常见 。 由 于 免费 数据 源 和 在 线 数据 库 的 普及 ， 数 据 整 合 已 经 成 为 一 个 更 重要 的 
课题 。 

只 要 存储 在 计算 机 系统 中 的 都 可 以 是 数据 整合 的 数据 部 分 。 数 据 的 实际 内 容 通常 没 
有 存储 数据 的 方式 那样 重要 。 大 多 数 情 况 下 , 数据 保存 在 有 组 织 的 信息 系统 的 数据 库 中 。 
这 些 系统 包含 唯一 的 条 目 和 字段 ， 允 许 用户 快 速 查找 信息 。 

任何 数据 整合 过 程 的 最 大 障碍 就 是 数据 本 身 。 在 许多 情况 下 ， 当 数据 首次 建立 时 ， 
并 未 考虑 将 数据 集 与 男 一 个 数据 集合 并 。 这 意味 着 即使 两 个 数据 集 可 能 是 相同 的 东西 ， 
它们 也 完全 不 兼容 。 

几乎 任何 东西 都 会 使 数据 库 不 兼容 。 简 单 的 如 不 同 演示 文稿 ， 其 字段 顺序 或 列 宽 就 
足以 阻止 二 者 轻松 的 合并 。 当 数据 差异 巨大 时 (例如 包含 更 多 或 更 少 信 息 的 一 个 数据 库 ) 
合并 要 困难 得 多 。 

在 商业 和 研究 这 两 个 领域 要 求 数据 整合 的 呼声 比 任何 其 他 领域 都 强烈 。 在 商业 界 ， 
合并 部 门 或 公司 数据 需要 将 以 前 单独 的 信息 组 合成 单一 的 结构 。 这 种 整合 通常 非常 困难 ， 
除非 原始 组 织 使 用 了 类 似 的 软件 并 有 具有 类 似 的 信息 目标 。 

当 为 研究 目的 进行 数据 整合 时 ， 通 常会 更 加 顺畅 。 当 一 位 研究 人 员 向 他 人 提供 信息 
时 ， 双 方 通常 会 研究 相同 的 过 程 。 这 意味 着 他 们 将 使 用 类 似 的 方法 对 其 数据 进行 分 类 和 
存储 。 
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过 去 数据 整合 是 数据 研究 的 一 个 相对 较 小 的 领域 ， 自 21 世纪 初 以 来 ， 情 况 发 生 了 
变化 。 随 着 免费 在 线 数据 库 变 得 越 来 越 流行 和 准确 ， 公 司 正 在 争取 以 可 共享 的 格式 获取 
他 们 的 信息 。 这 允许 他 们 以 公共 形式 发 布 其 信息 ， 并 将 一 些 著 名 的 公共 接口 的 私有 版 本 
集成 到 自己 的 系统 中 。 


3. 数据 转换 


数据 转换 是 将 信息 或 数据 从 一 种 格式 转换 成 男 一 格式 的 过 程 。 虽然 通常 的 策略 都 是 
将 文档 从 一 种 格式 转换 为 男 一 种 格式 , 但 数据 转换 也 可 把 一 种 计算 机 语言 编写 的 程序 转 
换 为 男 一 种 语言 编写 的 程序 ， 这 样 便于 程序 在 特定 平台 上 运行 。 实 际 的 转换 可 能 将 多 个 
数据 流转 换 成 通用 格式 ,或 者 将 单个 格式 转换 成 多 个 不 同 的 形式 ， 以 便 在 广泛 的 平台 上 
使 用 。 

数据 转换 的 过 程 涉及 使 用 所 谓 的 SQL 或 结构 化 查询 语言 。SQL 是 一 种 计算 机 语言 ， 
它 负 责 管理 存储 在 某 种 类 型 的 数据 管理 系统 中 的 信息 。 

在 实际 使 用 中 ， 数 据 转 换 使 用 可 执行 程序 ， 该 程序 能 够 读 取 基础 数据 或 原始 语言 
并 且 识 别 该 语言 ， 并 将 其 转换 为 其 他 程序 可 以 使 用 的 数据 。 一 旦 完成 了 转换 的 映射 ， 程 
序 就 将 数据 转换 为 所 需 的 单个 或 多 个 格式 ， 并 相应 地 发 布 转换 的 数据 。 在 许多 应 用 中 
这 在 几 秒 钟 内 就 可 完成 。 

一 个 类 似 的 过 程 被 称 为 数据 调节 。 像 数据 转换 一 样 ， 该 想法 是 使 一 种 格式 的 数据 能 
够 以 另 一 种 格式 使 用 。 与 数据 调节 不 同 ， 数 据 映射 过 程 涉及 创建 所 谓 的 数据 模型 ， 作 为 
所 涉及 的 两 种 格式 之 间 的 中 介 ， 而 不 是 直接 转换 信息 。 

与 许多 类 型 的 计算 机 技术 一 样 ， 数 据 转换 也 在 不 断 发 展 ， 因 为 新 程序 有 助 于 提高 信 
息 转 换 效率 和 扩大 转换 范围 。 随 着 这 个 过 程 中 包含 的 程序 和 格式 越 来 越 多 ， 在 许多 不 同 
平台 上 完全 不 兼容 的 数据 得 以 共享 。 在 全 球 设置 中 ,协作 者 并 不 总 是 使 用 相同 程序 或 语 
言 作为 数据 系统 的 基础 , 这 些 持续 改进 意味 着 在 系统 之 间 手 工 转 换 和 输入 数据 的 时 间 更 
少 了 。 


4. 数据 缩减 


数据 缩减 是 将 从 经 验 或 实验 中 得 出 的 数字 或 字母 数字 信息 转换 为 正确 、 有 序 和 简化 
的 形式 。 基 本 概念 是 将 大 量 数据 减少 到 有 意义 的 部 分 。 

当 信息 来 源 于 仪器 读数 时 ， 也 可 能 会 出 现 从 模拟 形式 到 数字 形式 的 变化 。 当 数据 已 
经 是 数字 形式 时 ， 数 据 的 “减少 ”通常 涉及 一 些 编辑 、 缩 放 、 编 码 、 排 序 、 整 理 和 生成 
表格 摘要 。 如 果 观 察 结果 是 离散 的 , 但 是 潜在 的 现象 是 连续 的 ， 则 通常 需要 进行 平滑 和 
插值 处 理 。 通 常 要 在 读 取 或 测量 错误 的 情况 下 进行 数据 缩减 。 在 确定 最 有 可 能 的 价值 之 
前 ， 需 要 考虑 这 些 错误 的 性 质 。 
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一 个 天 文学 的 例子 是 开 普 勒 卫星 的 数据 缩减 。 该 卫星 每 六 秒 记录 一 次 95 万 像素 的 
FUR, 每 秒 生成 数 十 兆 字 节 的 数据 ， 这 大 于 550KBps 的 下 行 链 路 带宽 的 数量 级 。 在 轨 数 
据 减少 包括 合并 了 三 十 分 钟 的 原始 数据 ， 带 宽 减少 到 原来 的 1/300。 此 外 ， 预 先 选择 感 
兴趣 的 目标 ， 并 只 处 理 相关 像素 ， 这 只 占 总 数 的 6% 。 然 后 将 此 减少 了 的 数据 发 送 到 地 
球 ， 进 一 步 处 理 。 

在 可 穿戴 (无线 ) 装置 中 ， 健 康 监测 和 诊断 应 用 也 使 用 数据 缩减 。 例 如 ， 在 癫 痢 诊 
断 时 ， 通 过 选择 并 且 仅 发 送 与 诊断 相关 的 EEG 数据 和 丢弃 背景 活动 数据 ， 使 用 数据 缩 
减 来 增加 可 戴 式 EEG 设备 的 电池 寿命 。 
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Text A 


Data Mining 


Data mining is a powerful new technology with great potential to help companies focus 
on the most important information in the data they have collected about the behavior of their 
customers and potential customers. It discovers information within the data that queries and 
reports can't effectively reveal. 


1. What is Data Mining? 


Data mining, or knowledge discovery, is the computer-assisted process of digging 
through and analyzing enormous sets of data and then extracting the meaning of the data. 
Data mining tools predict behaviors and future trends, allowing businesses to make proactive, 
knowledge-driven decisions. Data mining tools can answer business questions that 
traditionally were too time-consuming to resolve. They scour databases for hidden patterns, 
finding predictive information that experts may miss because it lies outside their expectations. 

Data mining derives its name from the similarities between searching for valuable 
information in a large database and mining a mountain for a vein of valuable ore. Both 
processes require either sifting through an immense amount of material, or intelligently 
probing it to find where the value resides. 
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2. What Can Data Mining Do? 


Although data mining is still in its infancy, companies in a wide range of industries — 
including retail, finance, heath care, manufacturing transportation, and aerospace — are 
already using data mining tools and techniques to take advantage of historical data. By using 
pattern recognition technologies and statistical and mathematical techniques to sift through 
warehoused information, data mining helps analysts recognize significant facts, relationships, 
trends, patterns, exceptions and anomalies that might otherwise go unnoticed. 

For businesses, data mining is used to discover patterns and relationships in the data in 
order to help make better business decisions. Data mining can help spot sales trends, develop 
smarter marketing campaigns, and accurately predict customer loyalty. Specific uses of data 
mining include: 

* Market segmentation—Identify the common characteristics of customers who buy the 

same products from your company. 

* Customer churn—Predict which customers are likely to leave your company and go to 

a competitor. 

* Fraud detection— Identify which transactions are most likely to be fraudulent. 

* Direct marketing— Identify which prospects should be included in a mailing list to 

obtain the highest response rate. 

* Interactive marketing—Predict what each individual accessing a Web site is most 

likely interested in seeing. 

* Market basket analysis— Understand what products or services are commonly 

purchased together; e.g., beer and diapers. 

* Trend analysis— Reveal the difference between a typical customer this month and last. 

Data mining technology can generate new business opportunities by: 

Automated prediction of trends and behaviors: Data mining automates the process of 
finding predictive information in a large database. Questions that traditionally required 
extensive hands-on analysis can now be directly answered from the data. A typical example of 
a predictive problem is targeted marketing. Data mining uses data on past promotional 
mailings to identify the targets most likely to maximize return on investment in future 
mailings. Other predictive problems include forecasting bankruptcy and other forms of default, 
and identifying segments of a population likely to respond similarly to given events. 

Automated discovery of previously unknown patterns: Data mining tools sweep through 
databases and identify previously hidden patterns. An example of pattern discovery is the 
analysis of retail sales data to identify seemingly unrelated products that are often purchased 
together. Other pattern discovery problems include detecting fraudulent credit card 
transactions and identifying anomalous data that could represent data entry keying errors. 
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Using massively parallel computers, companies dig through volumes of data to discover 
patterns about their customers and products. For example, grocery chains have found that 
when men go to a supermarket to buy diapers, they sometimes walk out with a six-pack of 
beer as well. Using that information, it’s possible to lay out a store so that these items are 
closer. 

AT&T, A.C. Nielson, and American Express are among the growing ranks of 
companies implementing data mining techniques for sales and marketing. These systems 
are crunching through terabytes of point-of-sale data to aid analysts in understanding 
consumer behavior and promotional strategies. Why? To gain a competitive advantage 
and increase profitability! 

Similarly, financial analysts are plowing through vast sets of financial records, data feeds, 
and other information sources in order to make investment decisions. Health-care 
organizations are examining medical records to understand trends of the past so they can 
reduce costs in the future. 


3. How Data Mining Works? 


How is data mining able to tell you important things that you didn’t know or what is 
going to happen next? The technique that is used to perform these feats is called modeling. 
Modeling is simply the act of building a model (a set of examples or a mathematical 
relationship) based on data from situations where the answer is known and then applying the 
model to other situations where the answers aren’t known. Modeling techniques have been 
around for centuries, of course, but it is only recently that data storage and communication 
capabilities required to collect and store huge amounts of data, and the computational power 
to automate modeling techniques to work directly on the data have been available. 

As a simple example of building a model, consider the director of marketing for a 
telecommunications company. He would like to focus his marketing and sales efforts on 
segments of the population most likely to become big users of long distance services. He 
knows a lot about his customers, but it is impossible to discern the common characteristics of 
his best customers because there are so many variables. From his existing database of 
customers, which contains information such as age, sex, credit history, income, zip code, 
occupation, etc., he can use data mining tools, such as neural networks, to identify the 
characteristics of those customers who make lots of long distance calls. For instance, he might 
learn that his best customers are unmarried females between the age of 34 and 42 who make 
in excess of $60,000 per year. This, then, is his model for high value customers, and he would 
budget his marketing efforts accordingly. 
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4. Data Mining Technologies 


The analytical techniques used in data mining are often well-known mathematical 
algorithms and techniques. What is new is the application of those techniques to general 
business problems made possible by the increased availability of data and inexpensive storage 
and processing power. Also, the use of graphical interfaces has led to tools becoming 
available that business experts can easily use. 

Some of the tools used for data mining are: 

Artificial neural networks—Non-linear predictive models that leam through training and 
resemble biological neural networks in structure. 

Decision trees—Tree-shaped structures that represent sets of decisions. These decisions 
generate rules for the classification of a dataset. 

Rule induction— The extraction of useful if-then rules from data based on statistical 
significance. 

Genetic algorithms — Optimization techniques based on the concepts of genetic 
combination, mutation, and natural selection. 

Nearest neighbor—A classification technique that classifies each record based on the 
records most similar to it in an historical database. 


5. Real-World Examples 


Details about who calls whom, how long they are on the phone, and whether a line is 
used for fax as well as voice can be invaluable in targeting sales of services and equipment to 
specific customers. But these tidbits are buried in masses of numbers in the database. By 
delving into its extensive customer-call database to manage its communications network, a 
regional telephone company identifies new types of unmet customer needs. Using its data 
mining system, it discovers how to pinpoint prospects for additional services by measuring 
daily household usage for selected periods. For example, households that make many lengthy 
calls between 3 p.m. and 6 p.m. are likely to include teenagers who are prime candidates for 
their own phones and lines. When the company uses target marketing that emphasizes 
convenience and value for adults— “Is the phone always tied up?” —hidden demand surfaces. 
Extensive telephone use between 9 a.m. and 5 p.m. characterized by patterns related to voice, 
fax, and modem usage suggests a customer has business activity. Target marketing offering 
those customers “business communications capabilities for small budgets” results in sales of 
additional lines, functions, and equipment. 

The ability to accurately gauge customer response to changes in business rules is a 
powerful competitive advantage. A bank searching for new ways to increase revenues from its 
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credit card operations tested a nonintuitive possibility: Would credit card usage and interest 
earned increase significantly if the bank halved its minimum required payment? With 
hundreds of gigabytes of data representing two years of average credit card balances, payment 
amounts, payment timeliness, credit limit usage, and other key parameters, the bank used a 
powerful data mining system to model the impact of the proposed policy change on specific 
customer categories. The bank discovered that cutting minimum payment requirements for 
small, targeted customer categories could increase average balances and extend indebtedness 
periods, generating more than $25 million in additional interest earned. Merck-Medco 
Managed Care is a mail-order business which sells drugs to the country’s largest health care 
providers. Merck-Medco is mining its one terabyte data warehouse to uncover hidden links 
between illnesses and known drug treatments, and spot trends that help pinpoint which drugs 
are the most effective for what types of patients. The results are more effective treatments that 
are also less costly. Merck-Medco’s data mining project has helped customers save an 
average of 10%-15% on prescription costs. 


6. The Future of Data Mining 


In the short-term, the results of data mining will be in profitable business related areas. 
Micro-marketing campaigns will explore new niches. Advertising will target potential 
customers with new precision. 

In the medium term, data mining may be as common and easy to use as e-mail. We may 
use these tools to find the best airfare to New York, root out a phone number of a long-lost 
classmate, or find the best prices on lawn mowers. 

The long-term prospects are truly exciting. Imagine intelligent agents turning loose on 
medical research data or on sub-atomic particle data. Computers may reveal new treatments 
for diseases. 


XW New Words 
behavior [bi'heivia] .举止 ， 行 为 
discover [dis'kave] wA, RI 
dig [dig] vi, Z, BH 
proactive Lprəu'æktiv] adj BRN, EAH, 
time-consuming | [taimken.sju:min] adfy. 耗 费时 间 的 ， 旷 日 持久 的 
scour [skaue] VY. 四 处 搜集 ， 冲 洗 ， 擦 亮 
expectation [ekspek'teif ən] ne, TUB, HB, RB 


similarity Lsimi'læriti] 7. 类 似 ， 类 似 处 
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vein [vein] .矿脉 ,纹理 
probe [preub] Vv. 探查 ,探测 
transportation [treenspo:'teif en] nZ, 运送 
aerospace [£ereuspeis] nn. 航空 航天 
sift [sift] 让 得 分 ， 精 选 ;审查 
vifi; 细 查 
relationship [rileif anf ip] nA, XH 
anomaly [anomali] nn. 不 规则 ， 异 常 的 人 或 物 
unnoticed [An'neutist] adj. 不 引 人 注 意 的 ， 被 忽视 的 
spot [spot] wk, RB 
segmentation [segmen'teifan] nn 分割 
churn [tfe:n] Y 流 失 
fraudulent [fro:djulent] adj. 欺 诈 的 ， 欺 骗 性 的 
bankruptcy [baenkrepsi] nF FE 
[baenkraptsi] 
sweep [swi:p] vd, Rit 
seemingly [si:minli] adv. 表 面 上 地 
anomalous [anomalas] adj. 不 规则 的 ， 反 常 的 
grocery ['greuseri] .食品 杂货 店 ， 食 品 店 ， 杂 货 铺 
crunch [krantf] DvVLLLLLMETMULLLEJ 
踏 过 
feat [fi:t] ne; WH, +H 
discern [disa:n] viii}, AA, HH, AAE 
occupation Lokju'peif en] .职业 
budget [badsit] nn. 预算 
好 做 预算 ， 编 入 预算 
inexpensive Liniks'pensiv] adj. 便 宜 的 ， 不 贵重 的 
artificial [La:tifif el] adj. A 3& th 
non-linear [non-linia] adf. 非 线性 的 
induction [in'dakf ən] 7. 归纳 
optimization [optimaizeiJen] 1. 最 佳 化 ， 最 优化 
mutation [mju:'teif en] 1. 变化 ， 转 变 ; ( 生物 物种 的 ) RE 
invaluable [in'vaeljuabl] adj. 无 价 的， 价值 无 法 衡量 的 
tidbit [tidbit] n. 一 小 口 CE), dt 
bury [beri] Ww 掩埋， 隐藏 
unmet [An met] adj. 未 满足 的 ， 未 相遇 的 ， 未 应 付 的 


pinpoint [pinpoint] DE E] 


162) 大 数据 专业 英语 教程 


adj. M Bc Bj 

v.d 8H 
nonintuitive [nonin' tju:itiv] adj. 非 直觉 的 
possibility [posibiliti] n. 可 能 性 
earn [a:n] vi, RA 
halve [ha:v] Vt 二 等 分 ， AM, DE, wt 
indebtedness [in'detidnis] .亏欠 ,债务 
mail-order [meil-'o:da] adj. 邮 购 的 
uncover [an'kave] PECES 
drug [drag] n. 药 ， 麻 药 

Yi. 吸毒 

vt 使 服毒 品 ， 毒 化 
treatment [tri:tment] .处理 ， 治 疗 
prescription [pri'skripf en] nk, HA 
profitable [profitebl] adj. 有 利 可 图 的 
niche [nits] nn. 小 生态 环境 ， 商 机 

XA Phrases 

knowledge discovery 知识 发 现 
computer-assisted process 计算 机 辅助 过 程 
knowledge-driven decision 知识 驱动 决策 
sift through fiz 
heath care 卫生 保健 
pattern recognition 模式 识别 


business decision 
customer churn 
response rate 

market basket analysis 
business opportunity 
targeted marketing 
seemingly unrelated product 
point-of-sale data 
investment decision 
neural network 
graphical interface 
predictive model 


业务 决策 ， 商 务 决 定 
客户 流失 

响应 率 

购物 篮 分 析 
业务 机 会 ， 商 业 机 会 
目标 市 场 

看 上 去 无 关 的 产品 
销售 终端 数据 
投资 决策 

神经 网 络 

图 形 界面 ， 图 形 接口 
预测 模型 
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decision tree 决策 树 

tule induction 规则 归纳 

genetic algorithm 遗传 算法 

nearest neighbor 最 邻近 算法 

delve into 钻研 ， 深 入 研究 

root out 搜寻 

lawn mower BSH, HHL 
XA Notes 


[1] Data mining is a powerful new technology with great potential to help companies focus on 

the most important information in the data they have collected about the behavior of their 
customers and potential customers. 
本 句 中 ，to help companies focus on the most important information in the data they have 
collected about the behavior of their customers and potential customers 是 一 个 动词 不 定 
式 短语 ,做 定语 ,修饰 和 限定 potential. 在 该 不 定式 短语 中 ,they have collected about 
the behavior of their customers and potential customers 是 一 个 定语 从 句 ， 修 饰 和 限定 
data。focus on 的 意思 是 “注重 ， 关 注 ”。 

[2] An example of pattern discovery is the analysis of retail sales data to identify seemingly 

unrelated products that are often purchased together. 
本 句 中 ，the analysis of retail sales data to identify seemingly unrelated products that are 
often purchased together 是 一 个 名 词性 短语 ， 作 表 语 。 在 该 名 词 短语 中 ，to identify 
seemingly unrelated products that are often purchased together 是 一 个 动词 不 定式 短语 ， 
作 定语 ， 修 饰 和 限定 retail sales data。 在 该 不 定式 短语 中 ，that are often purchased 
together 是 一 个 定语 从 句 ， 修 饰 和 限定 products. 

[3] Modeling is simply the act of building a model (a set of examples or a mathematical 
relationship) based on data from situations where the answer is known and then applying 
the model to other situations where the answers aren’t known. 

本 句 中 , where the answer is known 是 一 个 定语 从 句 , 修饰 和 限定 situations. where the 
answers aren't known 也 是 一 个 定语 从 多， 修饰 和 限定 other situations. based on 的 意 
思 是 “基于 ,根据 ”; apply... to 的 意思 是 “将 …… 应 用 于 ”。 

[4] From his existing database of customers, which contains information such as age, sex, 
credit history, income, zip code, occupation, etc., he can use data mining tools, such as 
neural networks, to identify the characteristics of those customers who make lots of long 


distance calls. 
AS)", which contains information such as age, sex, credit history, income, zip code, 
occupation, etc. 是 一 个 非 限定 性 定语 从 句 ， 对 his existing database of customers 进行 补 
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充 说 明 。to identify the characteristics of those customers who make lots of long distance 
calls 是 一 个 动词 不 定式 短语 , 作 目的 状语 , 修饰 主 句 的 谓语 use。 在 该 不 定式 短语 中 ， 
who make lots of long distance calls 是 一 个 定语 从 句 ， 修 饰 和 限定 those customers. 


XA Exercises 


[Ex1] 根据 课文 内 容 回答 问题 。 
1. What is data mining? 
2. Where does data mining derive its name from? 
3. What does data mining help analysts do? And how? 
4. What are the specific uses of data mining mentioned in the passage? 
5. What is a typical example of a predictive problem? What is an example of pattern 


discovery? 

6. Why are financial analysts are plowing through vast sets of financial records, data feeds, 
and other information sources? Why are health-care organizations examining medical 
records? 

7. What is modeling? 

8. What are some of the tools used for data mining? 

9. What is the result of Merck-Medco’s data mining project? 

10. What is the future of data mining? 


【 Ex.2 】 根 据 给 出 的 汉语 词义 和 规定 的 词类 写 出 相应 的 英语 单词 每 词 的 首 字母 已 给 出 。 
v. 据 ， 挖 ， 搜 集 d 


niii, TEL E, RA e 
nizi ek t 
n A Bi m 
.归纳 i 
.算法 a 
1. 小 生态 环境 ， 商 机 n 
nn. 精确 p 
adj JERTEH n 
7. 最 佳 化 ， 最 优化 o 
adj AV ERE c 
vli R c 


vlr, Hi HA S 


| Unit 9 m 


n. 关 系 ， 关 联 r 
væ., R p 
RR, RIE d 
vt. 认 出 ， 发 现 S 
n.AbER, WIT t 
17. 可 能 性 p 
7. 抽 出 ， 取 出 


【Ex.3 】 把 下 列 句子 翻译 为 中 文 。 

1. They discovered how to form the image in a thin layer on the surface. 

2. They will probe deeply into the matter. 

3. Campuses are usually accessible by public transportation. 

4. Segmentation of the market allows the bank to tailor its approach to the customers’ 
requirement. 

5. What modeling program are you using (include version number)? 

6. Students may pursue research in any aspect of computational linguistics. 

7. He wanted to look for occupation suited to his abilities. 

8. There has been an underspend in the department's budget. 

9. As to sequential pattern mining, mining algorithm is very important. 

10. The dataset must have a table before a relationship can be added. 


【Ex.4 】 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


CRM (customer relationship management) is an information industry term for 
methodologies, software, and usually Internet capabilities that help an enterprise manage 
customer relationships in an organized way. For example, an enterprise might — (1) a 
database about its customers that described relationships in (2) detail so that 
management, salespeople, people providing service, and perhaps the customer could directly 
access information, — (3) customer needs with product plans and offerings, remind 
customers of service _ (4) , know what other products a customer had purchased, and so 
forth. 

According to one industry view, CRM consists of: 

* Helping an enterprise to enable its _ (5) — departments to identify and target their 

best customers, manage marketing — (6) — and generate quality leads for the sales 


166 


大 数据 专业 英语 教程 


team. 

e Assisting the organization to _ (7) _ telesales, account, and sales management by 
optimizing information shared by multiple employees, and streamlining existing 
processes (for example, taking orders using mobile devices). 

* Allowing the formation of individualized relationships with customers, with the aim of 
improving customer satisfaction and , (8) profits; identifying the most profitable 
customers and providing them the highest level of service. 

* Providing _ (9) — with the information and processes necessary to know their customers, 
understand and identify customer needs and effectively build — (10) — between the 
company, its customer base, and distribution partners. 


Text B 


Top 6 Data Mining Algorithms 


1. C4.5 


What does it do? C4.5 constructs a classifier in the form of a decision tree. In order to do 
this, C4.5 is given a set of data representing things that are already classified. 

Wait, what's a classifier? A classifier is a tool in data mining that takes a bunch of data 
representing things we want to classify and attempts to predict which class the new data 
belongs to. 

What's an example of this? Sure, suppose a dataset contains a bunch of patients. We 
know various things about each patient like age, pulse, blood pressure, family history, etc. 
These are called attributes. 

Now: 

Given these attributes, we want to predict whether the patient will get cancer. The patient 
can fall into 1 of 2 classes: will get cancer or won't get cancer. C4.5 is told the class for each 
patient. 

And here's the deal: 

Using a set of patient attributes and the patient's corresponding class, C4.5 constructs a 
decision tree that can predict the class for new patients based on their attributes. 

Cool, so what's a decision tree? Decision tree learning creates something similar to a 
flowchart to classify new data. Using the same patient example, one particular path in the 
flowchart could be: 
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(1) Patient has a history of cancer 

(2) Patient is expressing a gene highly correlated with cancer patients 

(3) Patient has tumors 

(4) Patient's tumor size is greater than 5cm 

The bottom line is: 

At each point in the flowchart is a question about the value of some attribute, and 
depending on those values, he or she gets classified. 

Is this supervised or unsupervised? This is supervised learning, since the training dataset 
is labeled with classes. Using the patient example, C4.5 doesn't learn on its own that a patient 
will get cancer or won't get cancer. We told it first, it generated a decision tree, and now it 
uses the decision tree to classify. 

You might be wondering how C4.5 is different from other decision tree systems? 

(1) First, C4.5 uses information gain when generating the decision tree. 

(2) Second, although other systems also incorporate pruning, C4.5 uses a single-pass 
pruning process to mitigate over-fitting. Pruning results in many improvements. 

(3) Third, C4.5 can work with both continuous and discrete data. My understanding is it 
does this by specifying ranges or thresholds for continuous data thus turning continuous data 
into discrete data. 

(4) Finally, incomplete data is dealt with in its own ways. 

Why use C4.5? Arguably, the best selling point of decision trees is their ease of 
interpretation and explanation. They are also quite fast, quite popular and the output is human 
readable. 

Where is it used? A popular open-source Java implementation can be found over 
at OpenTox.Orange, an open-source data visualization and analysis tool for data mining, 
which implements C4.5 in their decision tree classifier. 

Classifiers are great, but make sure to checkout the next algorithm about clustering... 


2. k-means 


What does it do? k-means creates k groups from a set of objects so that the members of a 
group are more similar. It's a popular cluster analysis technique for exploring a dataset. 

Hang on, what's cluster analysis? Cluster analysis is a family of algorithms designed to 
form groups such that the group members are more similar versus non-group members. 
Clusters and groups are synonymous in the world of cluster analysis. 

Is there an example of this? Definitely, suppose we have a dataset of patients. In cluster 
analysis, these would be called observations. We know various things about each patient like 
age, pulse, blood pressure, etc. This is a vector representing the patient. 
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Look: 

You can basically think of a vector as a list of numbers we know about the patient. This 
list can also be interpreted as coordinates in multi-dimensional space. Pulse can be one 
dimension, blood pressure another dimension and so forth. 

You might be wondering: 

Given this set of vectors, how do we cluster together patients that have similar age, pulse, 
blood pressure, etc? 

Want to know the best part? 

You tell k-means how many clusters you want. k-means takes care of the rest. 

How does k-means take care of the rest? k-means has lots of variations to optimize for 
certain types of data. 

Ata high level, they all do something like this: 

(1) k-means picks points in multi-dimensional space to represent each of the k clusters. 
These are called centroids. 

(2) Every patient will be closest to 1 of these k centroids. They hopefully won't all be 
closest to the same one, so they'll form a cluster around their nearest centroid. 

(3) What we have are k clusters, and each patient is now a member of a cluster. 

(4) k-means then finds the center for each of the k clusters based on its cluster members 
(yep, using the patient vectors!). 

(5) This center becomes the new centroid for the cluster. 

(6) Since the centroid is in a different place now, patients might now be closer to other 
centroids. In other words, they may change cluster membership. 

(7) Steps(2)-(6) are repeated until the centroids no longer change, and the cluster 
memberships stabilize. This is called convergence. 

Is this supervised or unsupervised? It depends, but most would classify k-means as 
unsupervised. Other than specifying the number of clusters, k-means “learns” the clusters on 
its own without any information about which cluster an observation belongs to. k-means can 
be semi-supervised. 

Why use k-means? I don’t think many will have an issue with this: 

The key selling point of k-means is its simplicity. Its simplicity means it’s generally 
faster and more efficient than other algorithms, especially over large datasets. 

It gets better: 

k-means can be used to pre-cluster a massive dataset followed by a more expensive 
cluster analysis on the sub-clusters. k-means can also be used to rapidly “play” with k and 
explore whether there are overlooked patterns or relationships in the dataset. 

It’s not all smooth sailing: 
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Two key weaknesses of k-means are its sensitivity to outliers, and its sensitivity to the 
initial choice of centroids. One final thing to keep in mind is k-means is designed to operate 
on continuous data—you’ll need to do some tricks to get it to work on discrete data. 

Where is it used? A ton of implementations for k-means clustering are available online: 
Apache Mahout,Julia, R, SciPy, Weka, MATLAB, SAS. 

If decision trees and clustering didn’t impress you, you’re going to love the next 
algorithm. 


3. Support vector machines 


What does it do? Support vector machine (SVM) learns a hyperplane to classify data into 
2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM doesn’t use 
decision trees at all. 

As it turns out... 

SVM can perform a trick to project your data into higher dimensions. Once projected 
into higher dimensions... 

...SVM figures out the best hyperplane which separates your data into the 2 classes. 

Do you have an example? Absolutely, the simplest example I found starts with a bunch 
of red and blue balls on a table. If the balls aren't too mixed together, you could take a stick 
and without moving the balls, separate them with the stick. 

You see: 

When a new ball is added on the table, by knowing which side of the stick the ball is on, 
you can predict its color. 

What do the balls, table and stick represent? The balls represent data points, and the red 
and blue color represent 2 classes. The stick represents the simplest hyperplane which is a 
line. 

And the coolest part? 

SVM figures out the function for the hyperplane. 

What if things get more complicated? Right, they frequently do. If the balls are mixed 
together, a straight stick won't work. 

Here's the work-around: 

Quickly lift up the table throwing the balls in the air. While the balls are in the air and 
thrown up in just the right way, you use a large sheet of paper to divide the balls in the air. 

You might be wondering if this is cheating: 

Nope, lifting up the table is the equivalent of mapping your data into higher dimensions. 
In this case, we go from the 2 dimensional table surface to the 3 dimensional balls in the air. 
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How does SVM do this? By using a kernel we have a nice way to operate in higher 
dimensions. The large sheet of paper is still called a hyperplane, but it is now a function for a 
plane rather than a line. 

How do balls on a table or in the air map to real-life data? A ball on a table has a location 
that we can specify using coordinates. For example, a ball could be 20cm from the left edge 
and 50cm from the bottom edge. Another way to describe the ball is as (x, y) coordinates or 
(20, 50). x and y are 2 dimensions of the ball. 

Here’s the deal: 

If we had a patient dataset, each patient could be described by various measurements like 
pulse, blood pressure, etc. Each of these measurements is a dimension. 

The bottom line is: 

SVM does its thing, maps them into a higher dimension and then finds the hyperplane to 
separate the classes. 

Margins are often associated with SVM? What are they? The margin is the distance 
between the hyperplane and the 2 closest data points from each respective class. In the ball 
and table example, the distance between the stick and the closest red and blue ball is the 
margin. 

The key is: 

SVM attempts to maximize the margin, so that the hyperplane is just as far away from 
red ball as the blue ball. In this way, it decreases the chance of misclassification. 

Where does SVM get its name from? Using the ball and table example, the hyperplane is 
equidistant from a red ball and a blue ball. These balls or data points are called support 
vectors, because they support the hyperplane. 

Is this supervised or unsupervised? This is a supervised learning, since a dataset is used 
to first teach the SVM about the classes. Only then is the SVM capable of classifying new 
data. 

Why use SVM? SVM along with C4.5 are generally the 2 classifiers to try first. No 
classifier will be the best in all cases due to the No Free Lunch Theorem. In addition, kernel 
selection and interpretability are some weaknesses. 

Where is it used? There are many implementations of SVM. A few of the popular ones 
are scikit-learn, MATLAB and of course libsvm. 


4. Apriori 


The Apriori algorithm learns association rules and is applied to a database containing a 
large number of transactions. 
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What are association rules? Association rule learning is a data mining technique for 
learning correlations and relations among variables in a database. 

What’s an example of Apriori? Let’s say we have a database full of supermarket 
transactions. You can think of a database as a giant spreadsheet where each row is a customer 
transaction and every column represents a different grocery item. 

Here’s the best part: 

By applying the Apriori algorithm, we can learn the grocery items that are purchased 
together a.k.a association rules. 

The power of this is: 

You can find those items that tend to be purchased together more frequently than other 
items—the ultimate goal being to get shoppers to buy more. Together, these items are called 
itemsets. 

For example: 

You can probably quickly see that chips + dip and chips + soda seem to frequently occur 
together. These are called 2-itemsets. With a large enough dataset, it will be much harder to 

“see” the relationships especially when you're dealing with 3-itemsets or more. That's 
precisely what Apriori helps with! 

You might be wondering how Apriori works? Before getting into the nitty-gritty of 
algorithm, you'll need to define 3 things: 

(1) The first is the size of your itemset. Do you want to see patterns for a 2-itemset, 
3-itemset, etc.? 

(2) The second is your support or the number of transactions containing the itemset 
divided by the total number of transactions. An itemset that meets the support is called a 
frequent itemset. 

(3) The third is your confidence or the conditional probability of some item given you 
have certain other items in your itemset. A good example is given chips in your itemset, there 
is a 67% confidence of having soda also in the itemset. 

The basic Apriori algorithm is a 3 step approach: 

(1) Join. Scan the whole database for how frequent 1-itemsets are. 

(2) Prune. Those itemsets that satisfy the support and confidence move onto the next 
round for 2-itemsets. 

(3) Repeat. This is repeated for each itemset level until we reach our previously 
defined size. 

Is this supervised or unsupervised? Apriori is generally considered an unsupervised 
learning approach, since it's often used to discover or mine for interesting patterns and 
relationships. 

But wait, there's more... 
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Apriori can also be modified to do classification based on labelled data. 

Why use Apriori? Apriori is well understood, easy to implement and has many 
derivatives. 

On the other hand... 

The algorithm can be quite memory, space and time intensive when generating itemsets. 

Where is it used? Plenty of implementations of Apriori are available. Some popular ones 
are the ARtool, Weka, and Orange. 


5. EM 


What does it do? In data mining, expectation-maximization (EM) is generally used as a 
clustering algorithm (like k-means) for knowledge discovery. 

In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed 
data while estimating the parameters of a statistical model with unobserved variables. 

OK, hang on while I explain... 

I’m not a statistician, so hopefully my simplification is both correct and helps with 
understanding. 

Here are a few concepts that will make this way easier... 

What’s a statistical model? I see a model as something that describes how observed data 
is generated. For example, the grades for an exam could fit a bell curve, so the assumption 
that the grades are generated via a bell curve (a.k.a. normal distribution) is the model. 

Wait, what's a distribution? A distribution represents the probabilities for all measurable 
outcomes. For example, the grades for an exam could fit a normal distribution. This normal 
distribution represents all the probabilities of a grade. 

In other words, given a grade, you can use the distribution to determine how many exam 
takers are expected to get that grade. 

Cool, what are the parameters of a model? A parameter describes a distribution which is 
part of a model. For example, a bell curve can be described by its mean and variance. 

Using the exam scenario, the distribution of grades on an exam (the measurable 
outcomes) followed a bell curve (this is the distribution). The mean was 85 and the variance 
was 100. 

So, all you need to describe a normal distribution are 2 parameters: The mean, The 
variance. 

And likelihood? Going back to our previous bell curve example... suppose we have a 
bunch of grades and are told the grades follow a bell curve. However, we’re not given all the 
grades... only a sample. 
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Here’s the deal: 

We don’t know the mean or variance of all the grades, but we can estimate them using 
the sample. The likelihood is the probability that the bell curve with estimated mean and 
variance results in those bunch of grades. 

In other words, given a set of measurable outcomes, let’s estimate the parameters. Using 
these estimated parameters, the hypothetical probability of the outcomes is called likelihood. 

Remember, it’s the hypothetical probability of the existing grades, not the probability of 
a future grade. 

You’re probably wondering, what’s probability then? 

Using the bell curve example, suppose we know the mean and variance. Then we’re told 
the grades follow a bell curve. The chance that we observe certain grades and how often they 
are observed is the probability. 

In more general terms, given the parameters, let’s estimate what outcomes should be 
observed. That’s what probability does for us. 

Great! Now, what’s the difference between observed and unobserved data? Observed 
data is the data that you saw or recorded. Unobserved data is data that is missing. There a 
number of reasons that the data could be missing (not recorded, ignored, etc.). 

Here’s the kicker: 

For data mining and clustering, what’s important to us is looking at the class of a data 
point as missing data. We don’t know the class, so interpreting missing data this way is 
crucial for applying EM to the task of clustering. 

Once again: The EM algorithm iterates and optimizes the likelihood of seeing observed 
data while estimating the parameters of a statistical model with unobserved variables. 
Hopefully, this is way more understandable now. 

The best part is... 

By optimizing the likelihood, EM generates an awesome model that assigns class labels 
to data points—sounds like clustering to me! 

How does EM help with clustering? EM begins by making a guess at the model 
parameters. 

Then it follows an iterative 3-step process: 

(1) E-step: Based on the model parameters, it calculates the probabilities for assignments 
of each data point to a cluster. 

(2) M-step: Update the model parameters based on the cluster assignments from the 
E-step. 

(3) Repeat until the model parameters and cluster assignments stabilize (a.k.a. 
convergence). 
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Is this supervised or unsupervised? Since we do not provide labeled class information, 
this is unsupervised learning. 

Why use EM? A key selling point of EM is it's simple and straight-forward to 
implement. In addition, not only can it optimize for model parameters, it can also iteratively 
make guesses about missing data. 

This makes it great for clustering and generating a model with parameters. Knowing the 
clusters and model parameters, it's possible to reason about what the clusters have in common 
and which cluster new data belongs to. 

EM is not without weaknesses though... 

(1) First, EM is fast in the early iterations, but slow in the later iterations. 

(2) Second, EM doesn't always find the optimal parameters and gets stuck in local 
optima rather than global optima. 

Where is it used? The EM algorithm is available in Weka. R has an implementation in 
the mclust package. Scikit-learn also has an implementation in its gmm module. 


6. PageRank 


PageRank is a link analysis algorithm designed to determine the relative importance of 
some object linked within a network of objects. 

Yikes. What's link analysis? It’s a type of network analysis looking to explore the 
associations (a.k.a. links) among objects. 

Here's an example: The most prevalent example of PageRank is Google's search engine. 
Although their search engine doesn't solely rely on PageRank, it's one of the measures 
Google uses to determine a web page's importance. 

Let me explain: 

Web pages on the World Wide Web link to each other. If rayli.net links to a web page on 
CNN, a vote is added for the CNN page indicating rayli.net finds the CNN web page relevant. 

And it doesn't stop there... 

rayli.net's votes are in turn weighted by rayli.net's importance and relevance. In other 
words, any web page that's voted for rayli.net increases rayli.net's relevance. 

The bottom line? 

This concept of voting and relevance is PageRank. rayli.net's vote for CNN increases 
CNN's PageRank, and the strength of raylinet’s PageRank influences how much its vote 
affects CNN's PageRank. 

What does a PageRank of 0, 1, 2, 3, etc. mean? Although the precise meaning of a 
PageRank number isn't disclosed by Google, we can get a sense of its relative meaning. 

You see? 


| Unit 9 (78) 


It’s a bit like a popularity contest. We all have a sense of which websites are relevant and 
popular in our minds. PageRank is just an elegant way to define it. 

What other applications are there of PageRank? PageRank was specifically designed for 
the World Wide Web. 

Think about it: 

At its core, PageRank is really just a super effective way to do link analysis. The objects 
being linked don’t have to be web pages. 

Is this supervised or unsupervised? PageRank is generally considered an unsupervised 
learning approach, since it’s often used to discover the importance or relevance of a web page. 

Why use PageRank? Arguably, the main selling point of PageRank is its robustness due 
to the difficulty of getting a relevant incoming link. 

Simply stated: 

If you have a graph or network and want to understand relative importance, priority, 
ranking or relevance, give PageRank a try. 

Where is it used? The PageRank trademark is owned by Google. However, the PageRank 
algorithm is actually patented by Stanford University. 
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EM (Expectation-Maximization) 最 大 期 望 算法 


XA Exercises 


[Ex 5] 根据 课文 内 容 回答 问题 。 

1. How does C4.5 construct a classifier? 

2. What is a classifier? 

3. What does k-means do? 

4. What's cluster analysis? 

5. What is the key selling point of k-means? What does it mean? 

6. What does SVM do at a high-level? 

7. What is the margin? 

8. What are a few of the popular implementations of SVM? 

9. What does the Apriori algorithm do? 

10. Why is Apriori generally considered an unsupervised learning approach? 
11. What is expectation-maximization (EM) generally used as in data mining? 
12. What does the EM algorithm do in statistics? 

13. What's the difference between observed and unobserved data? 

14. What is PageRank? 

15. What's link analysis? 


参考 译文 


Jk d TZ th 
数据 挖掘 是 一 个 功能 强大 的 新 技术 ， 它 具有 巨大 潜力 ， 可 以 帮助 企业 专注 于 其 所 收 
集 的 客户 和 潜在 客户 的 行为 数据 中 最 重要 的 信息 。 它 能 够 发 现 查询 和 报表 数据 中 不 能 
效 揭示 的 信息 。 


1. 什么 是 数据 挖掘 


数据 挖掘 或 知识 发 现 是 一 种 计算 机 辅助 方法 , 它 挖掘 和 分 析 巨 量 的 数据 集 然 后 提取 
数据 的 意义 。 数 据 挖 掘 工具 可 以 预测 行为 和 未 来 的 发 展 趋势 ， 使 企业 做 出 积极 主动 的 知 
识 驱 动 的 决策 。 数 据 挖掘 工具 可 以 回答 传统 上 要 耗费 大 量 时 间 才 能 解决 的 业务 问题 。 它 
们 搜索 数据 库 中 的 隐藏 模式 ， 寻 找 专 家 没有 想到 而 可 能 会 错过 的 预测 信息 。 
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数据 挖掘 技术 从 大 型 数据 库 中 搜索 有 价值 的 信息 ， 这 与 从 山脉 挖掘 宝贵 的 矿石 相似 ， 
也 因此 得 名 。 这 两 个 过 程 都 需要 对 巨大 数量 的 材料 进行 筛选 ， 或 智能 探测 其 价值 所 在 。 


2. 挖掘 可 以 做 什么 


尽管 数据 挖掘 尚 处 于 起 步 阶段 , 但 已 经 使 用 数据 挖掘 工具 和 技术 的 公司 却 广泛 于 各 
个 行业 一 一 包括 零售 、 金 融 、 卫 生 保 健 、 制 造 运 输 和 航空 航天 一 一 它们 都 已 经 使 用 数据 
挖掘 工具 利用 历史 数据 。 通 过 使 用 模式 识别 技术 以 及 统计 和 数学 方法 来 筛选 信息 仓库 ， 
数据 挖掘 帮助 分 析 师 识别 重要 的 事实 、 关 系 、 趋 势 、 模 式 以 及 可 能 会 被 忽视 的 例外 和 异 
常情 况 。 

对 于 企业 来 说 , 数据 挖掘 是 用 来 发 现 数据 中 的 模式 和 关系 以 帮助 做 出 更 好 的 业务 决 
策 。 数 据 挖掘 技术 可 以 帮助 发 现 销售 趋势 ， 制 订 更 明智 的 营销 活动 计划 ， 并 准确 地 预测 
客户 的 忠诚 度 。 数 据 挖掘 的 具体 用 途 包括 : 

© 市 场 细 分 一 一 识别 从 公司 购买 相同 产品 的 客户 的 共同 特点 。 

e 客户 流失 一 一 预测 哪些 客户 有 可 能 离开 公司 去 购买 竞争 对 手 的 产品 。 

o 欺诈 检测 一 一 确定 哪些 交易 是 最 有 可 能 是 欺诈 。 

。 直销 一 一 确定 应 包含 在 邮件 列表 中 的 产品 ， 以 获得 最 高 的 响应 速度 。 

e. 互动 营销 一 一 预测 每 个 人 访问 网 站 时 可 能 最 感 兴趣 的 内 容 。 

e 市 场 购 物 篮 分 析 一 一 了 解 什么 样 的 产品 或 服务 通常 一 起 购买 ; 例如， 啤酒 和 

尿布 。 

。 趋势 分 析 一 一 显示 一 个 典型 的 客户 本 月 与 上 月 的 不 同 。 

数据 挖掘 技术 可 以 通过 以 下 方式 创造 新 的 商业 机 会 : 

自动 预测 趋势 和 行为 一 一 数据 挖掘 在 一 个 大 的 数据 库 自 动 发 现 预测 信息 。 传 统 上 需 
要 大 量 人 工分 析 的 问题 ， 现 在 可 以 直接 从 数据 中 得 到 答案 。 预 测 问题 的 典型 例子 是 目标 
营销 。 数 据 挖 掘 使 用 过 去 的 促销 邮件 数据 ， 以 确定 将 来 邮件 中 最 可 能 获得 最 大 回报 的 目 
标 人 和 群 。 其 他 预测 问题 包括 预测 破产 和 其 他 默认 形式 ， 和 可 能 对 特定 事件 做 出 相同 回应 
的 人 员 范围 。 

自动 发 现 以 前 未 知 的 模式 一 一 数据 挖掘 工具 扫描 数据 库 并 确定 以 前 未 见 的 模式 。 模 
式 发 现 的 一 个 例子 是 分 析 零 售 数据 ， 找 出 那些 经 常 一 起 购买 的 看 似 无 关 的 产品 。 其 他 模 
式 发 现 的 问题 包括 检测 欺诈 性 信用 卡 交易 并 识别 录入 错误 所 产生 的 异常 数据 。 

采用 大 规模 并 行 计算 机 ， 企 业 通 过 挖掘 大 量 数据 ， 发 现 他 们 的 客户 和 产品 模式 。 例 
如 ， 杂 货 连 锁 店 已 经 发 现 ， 当 男人 去 超市 买 尿 布 ， 他 们 有 时 也 带 走 一 包 六 瓶 的 啤酒 。 利 
用 这 些 信 息 ， 就 可 能 重新 摆 放 货物 ， 让 这 些 商 品 的 位 置 更 近 。 

AT&T AFLAC 尼尔森 和 美国 运通 公司 正在 销售 和 营销 中 率先 应 用 数据 挖掘 技术 。 
这 些 系统 通过 对 吉 字 节 的 销售 点 数据 运算 , 来 帮助 分 析 师 了 解 消费 者 行为 和 制订 促销 策 
略 。 为 什么 呢 ? 为 了 获得 竞争 优势 ， 提 高 盘 利 能 力 ! 
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同样 ， 金 融 分 析 师 通过 对 大 量 的 财务 记录 、 流 入 数据 和 其 他 信息 源 进行 分 析 ， 以 做 
出 投资 决定 。 医 疗 机 构 正 在 审查 医疗 记录 ， 以 了 解 过 去 的 趋势 ， 以 便 在 未 来 降低 成 本 。 


3. 数据 挖掘 如 何 工作 


数据 挖掘 为 何 能 够 告诉 你 不 知道 的 事情 或 者 接 下 来 会 发 生 的 事情 ? 那 是 因为 它 使 
用 了 称 为 建 模 的 技术 。 建 模 就 是 简单 地 基于 已 知情 况 的 数据 建立 模型 (一 组 例子 或 数学 
关系 ) ， 然 后 将 模型 应 用 到 未 知 答案 的 其 他 情况 中 。 建 模 技术 已 经 存在 了 几 个 世纪 ， 当 
然 ， 只 是 在 最 近 才 具有 了 数据 存储 以 及 收集 和 存储 大 量 数据 所 需 的 通信 能 力 ， 并 能 提供 
自动 建 模 技术 直接 使 用 数据 所 需 的 计算 能 力 。 

假定 电信 公司 的 营销 总 监 要 构建 模型 。 他 想 把 营销 和 销售 集中 于 最 有 可 能 成 为 长 途 
电话 的 大 用 户 的 人 群 。 他 对 客户 了 解 不 少 ,但 不 能 辨别 最 好 客户 的 共同 特点 ， 因 为 变化 
因素 众多 。 他 可 以 使 用 数据 挖掘 工具 〈 如 神经 网 络 ) 从 现 有 客户 数据 库 〈 其 中 包含 如 年 
龄 、 性 别 、 信 用 记录 、 收 入 、 邮 编 、 职 业 等 信息 ) 来 确定 大 量 的 长 途 电话 客户 的 特点 。 
例如 ， 他 可 能 知道 他 最 好 的 客户 是 34 一 42 岁 的 未 婚 女性 ， 每 年 话费 超过 60000 美元 。 
那么 ， 这 就 是 高 价值 客户 模型 ， 他 将 据 此 调整 自己 的 营销 预算 。 


4. 数据 挖掘 技术 


在 数据 挖掘 中 所 用 的 分 析 技术 就 是 众所周知 的 数学 算法 和 技术 。 它 的 新 颖 之 处 是 可 
以 通过 增加 的 数据 使 用 和 廉价 存储 以 及 处 理 能 力 的 提高 , 把 这 些 技术 应 用 到 解决 一 般 业 
务 问题 上 。 

此 外 ， 使 用 图 形 界 面 使 得 业务 专家 也 能 轻松 使 用 工具 。 

一 些 用 于 数据 挖掘 的 工具 有 : 

人 工 神 经 网 络 一 一 学 习 和 模拟 生物 神经 网 络 结构 的 非 线性 预测 模型 。 

决策 树 一 一 表示 决策 集 的 树 形 结构 。 它 们 生成 用 于 数据 集 分 类 的 决策 规则 。 

规则 归纳 一 一 从 基于 统计 学 意义 的 数据 中 提取 有 用 的 “如 果 - 那 么 ”规则 。 

遗传 算法 一 一 基于 遗传 组 合 、 变 异 和 自然 选择 的 概念 优化 技术 。 

邻近 算法 一 一 一 种 分 级 技术 ， 把 历史 数据 库 中 的 每 个 记录 按照 相似 性 进行 分 类 。 


5. 真实 世界 的 例子 


对 于 服务 目标 客户 和 设备 的 特定 客户 来 说 ， 谁 呼叫 谁 、 他 们 通话 多 长 时 间 以 及 线路 
是 否 被 用 于 传真 以 及 语音 这 些 细节 可 以 是 无 价 的 。 但 这 些 细节 都 埋藏 在 数据 库 中 的 众多 
数字 之 中 。 通 过 深入 研究 广泛 的 客户 呼叫 数据 库 来 管理 其 通信 网 络 ， 区 域 电 话 公司 可 以 
识别 新 型 的 未 满足 需求 的 客户 。 利 用 其 数据 挖掘 系统 ， 它 发 现 了 如 何 通过 测量 家 庭 在 选 
定时 间 内 日 常 使 用 电话 的 情况 来 确定 可 能 的 附加 服务 。 例 如 ， 在 下 午 3 点 和 下 午 6 点 之 
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间 有 许多 长 时 间 的 通话 的 那些 家 庭 中 ， 可 能 主要 是 家 里 的 青少年 在 打 电 话 。 当 公司 使 用 
目标 市 场 营 销 ,强调 成 年 人 的 便利 性 和 价值 时 一 一 “手机 总 是 占线 ? ” 隐藏 的 需求 
就 显示 出 来 了 。 上 午 9 点 到 下 午 5 点 电话 使 用 的 特点 是 使 用 语音 、 传 真 模 式 和 调制 解 调 
器 ， 这 表明 客户 在 进行 业务 活动 。 目 标 市 场 销售 要 为 这 些 客户 提供 “小 预算 业务 通信 性 
能 ”的 服务 ， 就 要 增加 附加 线 、 功 能 和 设备 的 销售 。 

能 够 准确 地 衡量 客户 对 业务 规则 变化 的 响应 能 力 是 一 个 强大 的 竞争 优势 。 银行 根 据 
信用 卡 操作 来 寻找 增加 收入 的 新 方法 ， 可 以 测试 其 非 直 观 性 : 如 果 银 行将 其 最 低 要 求 付 
款 减 半 会 使 信用 卡 和 利息 收入 显著 增加 吗 ?” 由 于 两 年 期 间 平均 信用 卡 余额 、 付 款 人 金额、 
付款 及 时 性 、 信 用 额度 的 使 用 情况 以 及 其 他 关键 参数 共计 数 百 吉 字 节 数据 ， 该 行 采用 了 
功能 强大 的 数据 挖掘 系统 来 模拟 所 提出 的 政策 变化 对 特定 客户 的 影响 。 该 银行 发 现 ， 如 
果 取 消 最 低 付款 要 求 ， 对 小 公司 和 目标 客户 可 能 会 增加 平均 余额 和 延长 债务 期 限 ， 产 生 
超过 2500 万 美元 的 额外 利息 。Merck-Medco Managed Care 〈 默 德 克 - 梅 德 科 管理 保健 公 
) 是 一 家 邮购 公司 ， 其 主要 业务 是 向 全 国 最 大 的 医疗 保健 提供 者 销售 药品 。Medco 公 
正在 挖掘 其 一 太 字 节 (TB ) 的 数据 仓库 以 便 发 现 疾病 和 已 知 的 药物 治疗 之 间 隐 藏 的 关 
联 , 帮助 确定 哪些 药物 对 什么 类 型 的 患者 是 最 有 效 的 。 结 果 是 , 更 有 效 的 治疗 也 更 便宜 。 
Merck-Medco 的 数据 挖掘 项 目 已 经 帮助 客户 节省 了 平均 10% 一 15% 的 药 费 。 
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6. 数据 挖掘 的 未 来 


在 短期 内 , 数据 挖掘 的 结果 将 用 于 便利 性 的 业务 领域 。 微 营销 活动 将 探索 新 的 商机 。 
广告 将 更 精准 地 瞄准 潜在 客户 。 

从 中 期 来 看 ， 数 据 挖掘 可 以 如 电子 邮件 一 样 普通 和 易 用 。 我 们 可 以 使 用 这 些 工具 来 
寻找 到 纽约 的 最 佳 机 票 ， 深 挖 了 久违 的 同学 的 电话 号 码 ， 或 找到 最 好 价格 的 割 草 机 。 

从 长 期 来 看 ， 前 景 是 真正 令 人 兴奋 的 。 想 象 一 下 ， 智 能 代理 放 开 了 医学 研究 资料 或 
亚 原子 粒子 数据 。 计 算 机 可 以 揭示 疾病 治疗 的 新 方法 。 
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Text A 


What Is Hadoop? 


Everyone’s talking about Hadoop, the hot new technology that’s highly prized among 
developers and just might change the world (again). But just what is it, anyway? Is it a 
programming language? A database? A processing system? An Indian tea cozy? 

The broad answer: Hadoop is all of these things (except the tea cozy), and more. It’s a 
software library that provides a programming framework for cheap, useful processing of 
another modern buzzword: big data. 


1. Where did Hadoop come from? 


Apache Hadoop is part of the Foundation Project from the Apache Software Foundation, 
a non-profit organization whose mission is to“ provide software for the public good.” As such, 
the Hadoop library is a free, open-source software available to all developers. 

The underlying technology that powers Hadoop was actually invented by Google. Back 
in the early days, the not-quite-giant search engine needed a way to index the massive 
amounts of data they were collecting from the Internet, and turn it into meaningful, relevant 
results for its users. With nothing available on the market that could meet their requirements, 
Google built their own platform. 

Those innovations were released in an open-source project called Nutch, which Hadoop 
later used as a foundation. Essentially, Hadoop applies the power of Google to big data in a 
way that’s affordable for companies of all sizes. 
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2. How is Hadoop Different from Past Techniques? 


Hadoop is more than just a faster, cheaper database and analytics tool. Unlike databases, 
Hadoop doesn't insist that you structure your data. Data may be unstructured and schemaless. 
Users can dump their data into the framework without needing to reformat it. By contrast, 
relational databases require that data be structured and schemas be defined before storing the 
data. 

Hadoop's simplified programming model allows users to quickly write and test software 
in distributed systems. Performing computation on large volumes of data has been done 
before, usually in a distributed setting but writing software for distributed systems is 
notoriously hard. By trading away some programming flexibility, Hadoop makes it much 
easier to write distributed programs. 

Because Hadoop accepts practically any kind of data, it stores information in far more 
diverse formats than what is typically found in the tidy rows and columns of a traditional 
database. Some good examples are machine-generated data and log data, written out in 
storage formats including JSON, Avro and ORC. 

The majority of data preparation work in Hadoop is currently being done by writing code 
in scripting languages like Hive, Pig or Python. 


Hadoop is easy to administer. 


Alternative high performance computing (HPC) systems allow programs to run on large 
collections of computers, but they typically require rigid program configuration and generally 
require that data be stored on a separate storage area network (SAN) system. Schedulers on 
HPC clusters require careful administration and since program execution is sensitive to node 
failure, administration of a Hadoop cluster is much easier. 

Hadoop invisibly handles job control issues such as node failure. If a node fails, Hadoop 
makes sure the computations are run on other nodes and that data stored on that node are 
recovered from other nodes. 


Hadoop is agile. 


Relational databases are good at storing and processing data sets with predefined and 
rigid data models. For unstructured data, relational databases lack the agility and scalability 
that are needed. Apache Hadoop makes it possible to cheaply process and analyze huge 
amounts of both structured and unstructured data together, and to process data without 
defining all structure ahead of time. 
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3. Why use Apache Hadoop? 


Apache Hadoop controls costs by storing data more affordably per terabyte than other 
platforms. Instead of thousands to tens of thousands per terabyte, Hadoop delivers compute 
and storage for hundreds of dollars per terabyte. 

Fault tolerance is one of the most important advantages of using Hadoop. Even if 
individual nodes experience high rates of failure when running jobs on a large cluster, data is 
replicated across a cluster so that it can be recovered easily in the face of disk, node or rack 
failures. 


It’s flexible. 


The flexible way that data is stored in Apache Hadoop is one of its biggest assets — 
enabling businesses to generate value from data that was previously considered too expensive 
to be stored and processed in traditional databases. With Hadoop, you can use all types of data, 
both structured and unstructured, to extract more meaningful business insights from more of 
your data. 


It’s scalable. 


Hadoop is a highly scalable storage platform, because it can store and distribute very 
large data sets across clusters of hundreds of inexpensive servers operating in parallel. The 
problem with traditional relational database management systems (RDBMS) is that they can’t 
scale to process massive volumes of data. 


4. How does Hadoop work? 


As mentioned previously, Hadoop isn’t one thing—it’s many things. Hadoop is a 
software library, which consists of four primary parts (modules), and a number of add-on 
solutions (like databases and programming languages) that enhance its real-world use. The 
four modules are: 

* Hadoop Common: This is the collection of common utilities (the common library) that 

supports Hadoop modules. 

* Hadoop Distributed File System (HDFS): A robust distributed file system with no 
restrictions on stored data (meaning that data can be either structured or unstructured 
and schemaless, where many DFSs will only store structured data) that provides 
high-throughput access with redundancy (HDFS allows data to be stored on multiple 
machines — so if one machine fails, availability is maintained through the other 
machines). 
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e Hadoop YARN: This framework is responsible for job scheduling and cluster resource 


management; it makes sure the data is spread out sufficiently over multiple machines 
to maintain redundancy. YARN is the module that makes Hadoop an affordable and 


cost-efficient way to process big data. 


* Hadoop MapReduce: This YARN-based system, built on Google technology, carries 
out parallel processing of large data sets (structured and unstructured). MapReduce 


can also be found in most of today's big data processing frameworks, including MPP 
and NoSQL databases. 
All of these modules working together generate distributed processing for large data sets. 


The Hadoop framework uses simple programming models that are replicated across clusters 
of computers, meaning the system can scale up from single servers to thousands of machines 
for increased processing power, rather than relying on hardware alone. 

Hardware that can handle the amount of processing power required to work with big data 
is expensive, to put it mildly. This is the true innovation of Hadoop: the ability to break down 


massive amounts of processing power across multiple, smaller machines, each with its own 
localized computation and storage, along with built-in redundancy at the application level to 


prevent failures. 


XW New Words 
developer [di velepa] 
buzzword [bazwa:d] 
mission [mifen] 
meaningful [mi:ninful] 
platform [plaetfo:m] 
essentially [i'senf eli] 
insist [in'sist] 
structure [strAktf o] 
dump [damp] 
notorious [neu'to:ries] 
diverse [dai'va:s] 
rigid [ridzid] 
scheduler [Jedju:la] 
agile ['ædzail] 
predefine [pri:di'fain] 


nF RA 

ne AGE, (rfe. BMA IC 
nëm, Ef 

adj. 有 意义 的 ， 有 意图 的 ; 意味 深长 的 
nn 平台 

adv. 本 质 上 ， 本 来 

坚持， 强调 

vt. a2 

.结构 ， 构 造 

了 好转 存 ; 倾倒 

nn. 堆 存 处 

adj. 自 名 远扬 的 ; 恶名 昭著 的 
adj. ARH, RLS iy 

adj. #84 

nn. 调 度 程序 

adj. 敏 捷 的 ， 轻 快 的 ， 灵 活 的 

Vt 预先 确定 ,预先 定义 


deg) 大 数据 专业 英语 教程 


17. 敏捷 性 

n. 可 伸展 性 

n. 太 字 节 (TB) , 1TB=1024GB=2“B 
复制 

adj. 可 升级 的 

n. 效 用 ， 有 用 ， 实 用 

nn. 限 制 ， 约 束 

nL, EPR, EPA 

n. 可 用 性 ， 有 效 性 ， 实 用 性 

v. E) 局 部 化 ， 本 地 化 


agility [a'd3iliti] 

scalability Lskeile'biliti] 

Terabyte [terebait] 

replicate [replikeit] 

scalable [‘skeilabl] 

utility [ju: tiliti] 

restriction [ris'trikf ən] 

throughput [eru:put] 

availability [aveila'biliti] 

localize [leukelaiz] 
XA Phrases 

programming language 

tea cozy 

software library 

Apache Software Foundation 


non-profit organization 
free, open-source software 
distributed system 
machine-generated data 

log data 

data preparation 

scripting language 

node failure 

fault tolerance 

job scheduling 

cluster resource management 
spread out 

carry out 

parallel processing 

scale up 

put it mildly 


XW Abbreviations 


HPC (High Performance Computing) 


SAN (Storage Area Network) 


编程 语言 

KEE 

软件 库 

Apache 软件 基金 会 (简称 为 ASF) 
3E a f A R 

免费 开源 软件 

分 布 式 系统 

机 器 生成 的 数据 

日 志 数据 

数据 准备 

脚本 语言 

节点 失效 ， 节 点 故障 ; 点 失效 
容错 

作业 调度 

集群 资源 管理 

分 散 ， 展 开 
完成 ， 实 现 ， 执 行 

并 行 处 理 

按 比 例 增 加 ， 按 比例 提高 
说 得 委婉 些 ， 说 得 好 听 一 点 


高 性 能 计算 
存储 区 域 网 络 
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RDBMS (Relational DataBase Management System) 关系 型 数据 库 管 理 系统 


HDFS (Hadoop Distributed File System) Hadoop 分 布 式 文件 处 理 系统 
DFS (Distributed File System) 分 布 式 文件 处 理 系统 
MPP (Massive Parallel Processing) 大 规模 并 行 处 理 

wa. Notes 


[1] Everyone’s talking about Hadoop, the hot new technology that’s highly prized among 
developers and just might change the world (again). 
本 句 中 ，the hot new technology that’s highly prized among developers and just might 
change the world (again) 是 一 个 名 词性 短语 ， 对 Hadoop 进行 补充 说 明 。 在 该 从 句 中 ， 
that’s highly prized among developers and just might change the world (again) 是 一 个 定 
语 从 句 ， 修 饰 和 限定 the hot new technology. 

[2] Those innovations were released in an open-source project called Nutch, which Hadoop 
later used as a foundation. 
本 句 中 ,called Nutch 是 一 个 过 去 分 词 短 语 , 作 定 语 , 修饰 和 限定 an open-source project. 
which Hadoop later used as a foundation 是 一 个 非 限 定性 定语 从 句 ， 对 Those 
innovations 进行 补充 说 明 。 

[3] Even if individual nodes experience high rates of failure when running jobs on a large 
cluster, data is replicated across a cluster so that it can be recovered easily in the face of 


disk, node or rack failures. 

本 句 中 ，Even if individual nodes experience high rates of failure when running jobs on a 
large cluster 是 一 个 让 步 状语 从 句 ， 修 饰 主 句 的 谓语 is replicated。 在 该 从 句 中 ，when 
running jobs on a large cluster 是 一 个 时 间 状 语 从 句 ， 修 饰 从 句 的 谓语 experience. so 
that it can be recovered easily in the face of disk, node or rack failures 是 一 个 目的 状语 从 
句 ， 修 饰 主 句 的 谓语 is replicated。 

[4] Hadoop is a software library, which consists of four primary parts (modules), and a 

number of add-on solutions (like databases and programming languages) that enhance its 
real-world use. 
本 句 中 , which consists of four primary parts (modules), and a number of add-on solutions 
(like databases and programming languages) that enhance its real-world use 是 一 个 非 限 
定性 定语 从 句 ， 对 a software library 进行 补充 说 明 。that enhance its real-world use 是 
一 个 定语 从 句 ， 修 饰 和 限定 four primary parts (modules), and a number of add-on 
solutions. (like databases and programming languages) 对 add-on solutions 举例 说 明 。 

[5] The Hadoop framework uses simple programming models that are replicated across 
clusters of computers, meaning the system can scale up from single servers to thousands 
of machines for increased processing power, rather than relying on hardware alone. 


188) 大 数据 专业 英语 教程 


本 人 句 中 ，that are replicated across clusters of computers 是 一 个 定语 从 句 ， 修 饰 和 限定 
simple programming models。 meaning the system can scale up from single servers to 
thousands of machines for increased processing power, rather than relying on hardware 
alone 是 对 前 面 整个 句子 的 解释 说 明 ， 可 以 扩展 为 一 个 非 限定 性 定语 从 名: which 


means the system can scale up from single servers to thousands of machines for increased 


processing power, rather than relying on hardware alone. 


XA Exercises 


[Ex 1] 根据 课文 内 容 回答 问题 。 

1. What is Apache Hadoop? 

2. What did the not-quite-giant search engine need back in the early days? 

3. What does Hadoop’s simplified programming model allow users to do? 

4. Why does Hadoop store information in far more diverse formats than what is typically 
found in the tidy rows and columns of a traditional database? 

5. What does Hadoop do if a node fails? 

6. What is one of the most important advantages of using Hadoop? 

7. Why is Hadoop is a highly scalable storage platform? 

8. What is the problem with traditional relational database management systems (RDBMS)? 
9. What are the four primary modules that Hadoop consists of? 

10. What is the true innovation of Hadoop? 


【Ex. 2】 把 下 列 句 子 翻译 为 中 文 。 

1. Using this method, each developer can provide their own physical path definition to this 
variable. 

. All the data is then dumped into the main computer. 

. All of the configuration and code is already implemented in the sample. 

. This is about the simplest weightless thread scheduler you could choose. 

. Such models align with agile thinking. 

. This can result in a variety of scalability and maintenance problems. 

. This allows the storage nodes to replicate data when a device is found to have failed. 

. Scalable bandwidth provides the solution while offering a more efficient use of network 
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resources. 

9. Redundancy and dependability give the cloud another edge. 

10. The most fundamental reason for a software company to localize product is to increase 
total revenue and net income. 
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[Ex 3] 短文 翻译 。 


1. What are the challenges of using Hadoop? 


MapReduce programming is not a good match for all problems. It’s good for simple 
information requests and problems that can be divided into independent units, but it’s not 
efficient for iterative and interactive analytic tasks. MapReduce is file-intensive. Because the 
nodes don’t intercommunicate except through sorts and shuffles, iterative algorithms require 
multiple map-shuffle/sort-reduce phases to complete. This creates multiple files between 
MapReduce phases and is inefficient for advanced analytic computing. 

There’s a widely acknowledged talent gap. It can be difficult to find entry-level 
programmers who have sufficient Java skills to be productive with MapReduce. That’s one 
reason distribution providers are racing to put relational (SQL) technology on top of Hadoop. 
It is much easier to find programmers with SQL skills than MapReduce skills. And, Hadoop 
administration seems part art and part science, requiring low-level knowledge of operating 
systems, hardware and Hadoop kernel settings. 

Data security. Another challenge centers around the fragmented data security issues, 
though new tools and technologies are surfacing. The Kerberos authentication protocol is a 
great step toward making Hadoop environments secure. 

Full-fledged data management and governance. Hadoop does not have easy-to-use, 
full-feature tools for data management, data cleansing, governance and metadata. Especially 
lacking are tools for data quality and standardization. 


2. Why is Hadoop important? 


Ability to store and process huge amounts of any kind of data, quickly. With data 
volumes and varieties constantly increasing, especially from social media and the Internet of 
Things (IoT), that’s a key consideration. 

Computing power. Hadoop’s distributed computing model processes big data fast. The 
more computing nodes you use, the more processing power you have. 

Fault tolerance. Data and application processing are protected against hardware failure. If 
a node goes down, jobs are automatically redirected to other nodes to make sure the 
distributed computing does not fail. Multiple copies of all data are stored automatically. 

Flexibility. Unlike traditional relational databases, you don’t have to preprocess data 
before storing it. You can store as much data as you want and decide how to use it later. That 
includes unstructured data like text, images and videos. 

Low cost. The open-source framework is free and uses commodity hardware to store 


190 


大 数据 专业 英语 教程 


large quantities of data. 
Scalability. You can easily grow your system to handle more data simply by adding 
nodes. Little administration is required. 


[Ex 4] 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


bottom duplicate machines special node 


source collected operations completion individual 


1. Background of Hadoop 


With an increase in the penetration of internet and the usage of the internet, the data 
captured by Google increased exponentially year on year. Just to give you an estimate of this 
number, in 2007 Google (1) on an average 270 PB of data every month. The same 
number increased to 20000 PB everyday in 2009. Obviously, Google needed a better platform 
to process such an enormous data. Google implemented a programming model called 
MapReduce, which could process this 20000 PB per day. Google ran these MapReduce 
operations on a _ (2) special file system called Google File System (GFS). Sadly, GFS is 
not an open source. 

Doug cutting and Yahoo! reverse engineered the model GFS and built a parallel Hadoop 
Distributed File System (HDFS). The software or framework that supports HDFS and 
MapReduce is known as Hadoop. Hadoop is an open , (3) and distributed by Apache. 


2. Framework of Hadoop Processing 


Let's draw an analogy from our daily life to understand the working of Hadoop. The 
bottom of the pyramid of any firm are the people who are _ (4) contributors. They can be 
analyst, programmers, manual labors, chefs, etc. Managing their work is the project manager. 
The project manager is responsible for a successful _ (5) of the task. He needs to 
distribute labor, smoothen the coordination among them etc. Also, most of these firms have a 
people manager, who is more concerned about retaining the head count. 

Hadoop works in a similar format. On the — (6) we have machines arranged in 
parallel. These machines are analogous to individual contributor in our analogy. Every 
machine has a data node and a task tracker. Data node is also known as HDFS (Hadoop 
Distributed File System) and task tracker is also known as map-reducers. 

Data node contains the entire set of data and task tracker does all the — (7) . You can 
imagine task tracker as your arms and leg, which enables you to do a task and data node as 
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your brain, which contains all the information which you want to process. These _ (8) are 
working in silos and it is very essential to coordinate them. The task trackers (project manager 
in our analogy) in different machines are coordinated by a job tracker. Job tracker makes sure 
that each operation is completed and if there is a process failure at any , (9) , it needs to 
assign a duplicate task to some task tracker. Job tracker also distributes the entire task to all 
the machines. 

A name node on the other hand coordinates all the data nodes. It governs the distribution 
of data going to each machine. It also checks for any kind of purging which have happened on 
any machine. If such purging happens, it finds the _ (10) data which was sent to other 
data node and duplicates it again. You can think of this name node as the people manager in 
our analogy which is concerned more about the retention of the entire dataset. 


Text B 


Apache Spark 


Apache Spark is an open source parallel processing framework for running large-scale 
data analytics applications across clustered computers. It can handle both batch and real-time 
analytics and data processing workloads. 

Spark became a top-level project of the Apache Software Foundation in February 2014, 
and version 1.0 of Apache Spark was released in May 2014. Spark version 2.0 was released in 
July 2016. 

The technology was initially designed in 2009 by researchers at the University of 
California, Berkeley as a way to speed up processing jobs in Hadoop systems. 

Spark Core, the heart of the project that provides distributed task transmission, 
scheduling and I/O functionality, provides programmers with a potentially faster and more 
flexible alternative to MapReduce. MapReduce is the software framework to which early 
versions of Hadoop were tied. Spark's developers say it can run jobs 100 times faster than 
MapReduce when processed in memory, and 10 times faster on disk. 


1. How Apache Spark works 


Apache Spark can process data from a variety of data repositories, including the Hadoop 
Distributed File System (HDFS), NoSQL databases and relational data stores, such as Apache 
Hive. Spark supports in-memory processing to boost the performance of big data analytics 
applications, but it can also perform conventional disk-based processing when data sets are 
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too large to fit into the available system memory. 

The Spark Core engine uses the resilient distributed data set, or RDD, as its basic data 
type. The RDD is designed in such a way so as to hide much of the computational complexity 
from users. It aggregates data and partitions it across a server cluster, where it can then be 
computed and either moved to a different data store or run through an analytic model. The 
user doesn’t have to define where specific files are sent or what computational resources are 
used to store or retrieve files. 

In addition, Spark can handle more batch processing applications than MapReduce. 


2. Spark libraries 


The Spark Core engine functions partly as an application programming interface (API) 
layer and underpins a set of related tools for managing and analyzing data. Aside from the 
Spark Core processing engine, the Apache Spark API environment comes packaged with 
some libraries of code for use in data analytics applications. 


2.1 Spark Core 


Spark Core is the foundation of the overall project. It provides distributed task 
dispatching, scheduling, and basic I/O functionalities, exposed through an application 
programming interface (for Java, Python, Scala, and R) centered on the RDD abstraction (the 
Java API is available for other JVM languages, but is also usable for some other non-JVM 
languages, such as Julia, that can connect to the JVM). This interface mirrors a 
functional/higher-order model of programming: a“ driver "program invokes parallel operations 
such as map, filter or reduce on an RDD by passing a function to Spark, which then schedules 
the function's execution in parallel on the cluster. These operations, and additional ones such 
as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their 
operations are lazy; fault-tolerance is achieved by keeping track of the “lineage” of each RDD 
(the sequence of operations that produced it) so that it can be reconstructed in the case of data 
loss. RDDs can contain any type of Python, Java, or Scala objects. 

Besides the RDD-oriented functional style of programming, Spark provides two 
restricted forms of shared variables: broadcast variables and accumulators. Broadcast 
variables reference read-only data that needs to be available on all nodes, while accumulators 
can be used to program reductions in an imperative style. Transform an RDD into a new 
RDD. 


2.2 Spark SQL 


Spark SQL is a component on top of Spark Core that introduces a data abstraction called 
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DataFrames, which provides support for structured and semi-structured data. Spark SQL 
provides a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, or 
Python. It also provides SQL language support, with command-line interfaces and 
ODBC/JDBC server. Although DataFrames lacks the compile-time type-checking afforded by 
RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well. 


2.3 Spark Streaming 


Spark Streaming uses Spark Core’s fast scheduling capability to perform streaming 
analytics. It ingests data in mini-batches and performs RDD transformations on those 
mini-batches of data. This design enables the same set of application code written for batch 
analytics to be used in streaming analytics, thus facilitating easy implementation of lambda 
architecture. However, this convenience comes with the penalty of latency equal to the 
mini-batch duration. Other streaming data engines that process event by event rather than in 
mini-batches include Storm and the streaming component of Flink. Spark Streaming has 
support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP 
sockets. 

In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that 
has a higher-level interface is also provided to support streaming. 


2.4 MLIib (Machine Learning Library) 


Spark MLIib is a distributed machine learning framework on top of Spark Core that, due 
in large part to the distributed memory-based Spark architecture, is as much as nine times as 
fast as the disk-based implementation used by Apache Mahout (according to benchmarks 
done by the MLlib developers against the alternating least squares (ALS) implementations, 
and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit. 
Many common machine learning and statistical algorithms have been implemented and are 
shipped with MLlib which simplifies large scale machine learning pipelines, including: 

e summary statistics, correlations, stratified sampling, hypothesis testing, random data 

generation 

* classification and regression: support vector machines, logistic regression, linear 

regression, decision trees, naive Bayes classification 

* collaborative filtering techniques including alternating least squares (ALS) 

* cluster analysis methods including k-means, and latent Dirichlet allocation (LDA) 

* dimensionality reduction techniques such as singular value decomposition (SVD), and 

principal component analysis (PCA) 

* feature extraction and transformation functions 
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* optimization algorithms such as stochastic gradient descent, limited-memory BFGS 
(L-BFGS). 


2.5 GraphX 


GraphX is a distributed graph processing framework on top of Apache Spark. Because it 
is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable 
for graphs that need to be updated, let alone in a transactional manner like a graph database. 
GraphX provides two separate APIs for implementation of massively parallel algorithms 
(such as PageRank): a Pregel abstraction, and a more general MapReduce style API. Unlike 
its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support 
for property graphs (graphs where properties can be attached to edges and vertices). 

GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which 
utilized Hadoop disk-based MapReduce. 

Like Apache Spark, GraphX initially started as a research project at UC Berkeley's 
AMPLab and Databricks, and was later donated to the Apache Software Foundation and the 
Spark project. 


3. Spark languages 


Spark was written in Scala, which is considered the primary language for interacting 
with the Spark Core engine. Out of the box, Spark also comes with API connectors for using 
Java and Python. Java is not considered an optimal language for data engineering or data 
Science, so many users rely on Python, which is simpler and more geared toward data 
analysis. 

There is also an R programming package that users can download and run in Spark. This 
enables users to run the popular desktop data science language on larger distributed data sets 
in Spark and to use it to build applications that leverage machine learning algorithms. 


4. Apache Spark use cases 


The wide range of Spark libraries and its ability to compute data from many different 
types of data stores means Spark can be applied to many different problems in many 
industries. Digital advertising companies use it to maintain databases of web activity and 
design campaigns tailored to specific consumers. Financial companies use it to ingest 
financial data and run models to guide investing activity. Consumer goods companies use it to 
aggregate customer data and forecast trends to guide inventory decisions and spot new market 
opportunities. 
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Large enterprises that work with big data applications use Spark because of its speed and 
its ability to tie together multiple types of databases and to run different kinds of analytics 
applications. 


全 New Words 
workload [we:kleud] 17. 工作 量 
conventional [kanvenJanl] a 丰 .常规 的 ， 传 统 的 
engine ['endsin] n.9| % 
resilient [rizilient] adj. 可 恢复 的 ; 弹 回 的 
partition [pa:'tif en] nn 分 割 ， 划 分 分开; 隔离 物 
Vt. 区分， 隔 开 ， 分割 
retrieve [ritri:v] vk, EGA 
underpin [Andepin] VY. 加 强 …… 的 基础 ， 巩 固 ， 支 撑 
fault-tolerance [fo:lt-'tolarans] nn. 容错 
lineage [liniid3] nii, E% 
reconstructed Lri:kən'straktıd] adj. 重 建 的 ， 改 造 的 
accumulator [ə'kju:mjuleitə] .累加 器 
abstraction [seb'straek f en] .提取 
ingest [in'dzest] vtAROR, RR, Rik 
penalty [penelti] n. sb 
duration [djue'reif en] 17. 持续 时 间 ， 为 期 
pipeline [paiplain] nii, 传递 途径 
correlation [.kori'leif an] 11. 相 互 关系 ， 相 关 〈 性 ) 
random ['reendam] adj. 随 机 的 
regression [ri'gref en] n.i 
dimensionality [dimen] e'naeliti] 17. 维度 
stochastic [steu'kaestik] adj. 随 机 的 
unsuitable [An'sju:tebl] adf. 不 适合 的 ， 不 相称 的 
predecessor ['pri:disesa] nw, 前任; GARRA) 原 有 事物 
connector [ka'nakte] .连接 器 
invest [in vest] vA 
XA Phrases 
open source 开放 源 代码 ， 开源 


parallel processing 并 行 处 理 
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fit into 

computational complexity 

batch processing 

parallel operation 

broadcast variable 

imperative style 

command-line interface 

lambda architecture 

distributed machine learning framework 
large scale machine learning pipelines 


stratified sample 

hypothesis testing 

support vector machine 

logistic regression 

linear regression 

decision tree 

naive Bayes classification 
dimensionality reduction techniques 
optimization algorithm 

stochastic gradient descent 
limited-memory BFGS (L-BFGS) 
distributed graph processing framework 
interact with... 

distributed data set 

consumer goods 


他 Abbreviations 


HDFS (Hadoop Distributed File System) 
RDD (Resilient Distributed Datasets) 
API (application programming interface) 
LO (Input/Output) 

JVM (Java Virtual Machine) 

DSL (Domain-Specific Language) 
ODBC (Open Database Connectivity) 
JDBC(Java DataBase Connectivity) 


适合 

计算 的 复杂 性 

批 处 理 

并 行 操 作 

广播 变量 
强制 方式 ， 命 令 式 风格 
命令 行 界面 

入 结构 

分 布 式 机 器 学 习 框架 
大 规模 机 器 学 习 流水 线 ， 大 规模 机 器 学 
习 管 道 

分 层 取样 

假设 检验 

支持 向 量 机 

逻辑 回归 

线形 回归 
决策 树 ， 分 层次 决策 
朴素 贝 叶 斯 分 类 

降 维 技术 

最 优化 算法 

随机 梯度 下 降 

内 存 受 限 的 BFGS 算法 
分 布 式 图 形 处 理 结构 


分 布 式 数据 集 
生活 消费 品 


Hadoop 分 布 式 文件 系统 
弹性 分 布 式 数据 集 

应 用 程序 接口 

输入 /输出 

Java 虚拟 机 

领域 专用 语言 

开放 数据 库 连 接 

Java 数据 库 连接 
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MLlib (Machine Learning Library) 机 器 学 习 库 

ALS (alternating least squares) 交替 最 小 二 乘 

LDA (Latent Dirichlet Allocation) 潜在 狄 利克 雷 分 布 

SVD (Singular Value Decomposition) 奇异 值 分 解 

PCA (Principal Component Analysis) 主 成 分 分 析 

BFGS (Broyden, Fletcher, Goldforb, Shannon) 布 罗 依 丹 、 弗 莱 彻 、 戈 德 福 布 、 香 农 四 
个 人 名 的 首 字 母 


XA Exercises 


【Ex. 5 】 根据 课文 内 容 回答 问题 。 

1. What is Apache Spark? 

2. What can Apache Spark do? 

3. What does the Spark Core engine use as its basic data type? 

4. What is Spark Core? 

5. What are the two restricted forms of shared variables Spark provides? 

6. What does Spark Streaming use Spark Core’s fast scheduling capability to do? 

7. What is Spark MLIib? 

8. What are the two separate APIs GraphX provides for implementation of massively parallel 

algorithms? 

9. What was Spark written in? 

10. What do digital advertising companies and consumer goods companies use Apache Spark 
to do respectively? 


参考 译文 


什么 是 Hadoop 


每 个 人 都 在 谈论 Hadoop， 这 是 开发 者 非常 重视 的 热门 新 技术 ， 有 可 能 〈 再 次 ) 改 
变 世 界 。 但 是 它 是 什么 呢 ? 是 编程 语言 ? 数据 库 ? 一 个 处 理 系统 ? 还 是 印度 茶壶 套 ? 

宽泛 的 答案 是 : Hadoop 是 所 有 这 些 事情 〈 除 了 茶壶 套 ) ， 甚 至 更 多 。 它 是 一 个 软件 
库 ， 提 供 了 一 个 编程 框架 ， 可 用 来 便宜 而 有 用 地 处 理 大 数据 (大 数据 是 另 一 个 现代 流行 
词汇 ) 。 
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1. Hadoop 来 自 哪 里 


Apache Hadoop 是 Apache Software Foundation 基础 项 目的 一 部 分 ,该 软件 基金 会 是 
一 个 非 盘 利 组 织 ， 其 任务 是 “为 公共 事业 提供 软件 ”。 因 此 ，Hadoop 库 是 免费 的 开源 
软件 ， 可 供 所 有 开发 人 员 使 用 。 

Hadoop 的 基础 技术 实际 上 由 谷歌 公司 研发 。 早 期 的 时 候 ， 搜 索引 擎 并 不 巨大 ， 它 
需要 一 种 方式 来 检索 从 互联 网 收集 的 大 量 数据 ， 并 将 其 转化 为 对 用 户 有 用 的 相关 结果 。 
谷歌 公司 在 市 场 上 找 不 到 可 以 满足 其 需求 的 产品 ， 就 自己 建立 了 平台 。 

这 些 创 新 在 一 个 名 为 Nutch 的 开放 源码 项 目 中 发 布 ,后 来 被 用 作 Hadoop 的 基础 。 
重要 的 是 ，Hadoop 将 谷歌 的 强大 功能 应 用 于 大 数据 ， 并 提供 了 适合 各 种 规模 公司 的 
X. 


过 


2. Hadoop 与 过 去 的 技术 有 何不 同 


Hadoop 不 仅仅 是 一 个 更 快 、 更 便宜 的 数据 库 和 分 析 工 具 。 与 数据 库 不 同 ，Hadoop 
并 不 强调 数据 结构 。 数 据 可 能 是 非 结构 化 和 无 模式 的 。 用户 可 以 将 其 数据 转 储 到 框架 中 ， 
而 无 须 重 新 格式 化 。 相 比 之 下 ， 关 系数 据 库 要 求 在 存储 数据 之 前 对 数据 进行 结构 化 和 模 

Hadoop 简化 的 编程 模式 允许 用 户 在 分 布 式 系 统 中 快速 编写 和 测试 软件 。 以 前 就 可 
以 对 大 量 数据 执行 计算 ， 但 通常 要 进行 分 布 式 设置 ， 要 为 分 布 式 系统 编写 软件 是 非常 困 
难 的 。 通 过 放弃 一 些 编程 灵活 性 ，Hadoop 使 编写 分 布 式 程序 变 得 更 加 容易 。 

由 于 Hadoop 几乎 可 以 接受 任何 类 型 的 数据 ， 它 以 比 传统 数据 库 多 得 多 的 格式 存储 
信息 ， 这 些 数据 原来 整齐 地 存储 在 数据 库 的 行列 中 。 一 些 很 好 的 例子 是 机 器 生成 的 数据 
和 日 志 数据 ， 以 包含 JSON、Avro 和 ORC 存储 格式 的 数据 。 

Hadoop 中 的 大 部 分 数据 准备 工作 目前 用 脚本 语言 (如 Hive. Pig 或 Python) 编写 的 
程序 来 完成 。 

Hadoop 易于 管理 。 

备 选 的 高 性 能 计算 HPC) 系统 允许 程序 在 大 量 计算 机 上 和 运行， 但 是 通常 需要 严格 
的 程序 配置 ， 并 且 通 常 要 求 数据 存储 在 单独 的 存储 区 域 网 络 (SAN) 系统 上 。HPC 集群 
上 的 调度 程序 需要 精细 管理 ， 并 且 由 于 程序 执行 对 节点 故障 十 分 敏感 ， 所 以 Hadoop 集 
群 的 管理 要 容易 得 多 。 

Hadoop 默默 地 处 理 诸如 节点 故障 之 类 的 作业 控制 问题 。 如 果 节 点 出 现 故障 ， 那 么 
Hadoop 将 确保 在 其 他 节点 上 运行 计算 ， 并 且 从 其 他 节点 恢复 存储 在 该 节点 上 的 数据 。 

Hadoop 是 敏捷 的 。 

关系 数据 库 能 很 好 地 存储 和 处 理 具 有 预定 义 和 刚 性 数据 模型 的 数据 集 。 对 于 非 结构 
化 数据 ， 关 系数 据 库 缺 乏 所 需 的 敏捷 性 和 可 扩展 性 。Apache Hadoop 能 够 便宜 地 对 大 量 
的 结构 化 和 非 结 构 化 数据 一 起 处 理 和 分 析 ， 并 且 处 理 数据 时 无 须 提前 定义 所 有 结构 。 
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3. 为 什么 要 使 用 Apache Hadoop 


Apache Hadoop 通过 比 其 他 平台 更 容易 地 存储 每 TB 的 数据 来 控制 成 本 。 用 Hadoop 
计算 和 存储 每 TB 数据 只 需 花 费 数 百 美元 ， 而 不 用 花费 数 千 到 数 万 美元 。 

容错 是 Hadoop 最 重要 的 优点 之 一 。 即使 单个 节点 在 大 型 集群 上 运行 作业 时 遇 到 很 高 的 
故障 率 ， 也 可 以 跨 集群 复制 数据 ， 以 便 在 面 对 磁 盘 、 节 点 或 机 架 故 障 时 可 以 轻松 恢复 。 

Hadoop 是 灵活 的 。 

灵活 地 在 Apache Hadoop 中 存储 数据 是 其 最 大 的 价值 ， 使 企业 能 够 从 数据 中 生成 价 
值 ， 而 这 些 数据 先前 要 用 昂贵 的 传统 数据 库 中 进行 存储 和 处 理 。 使 用 Hadoop， 可 以 使 用 所 
有 类 型 的 结构 化 和 非 结构 化 数据 ， 因 此 能 够 从 更 多 的 数据 中 提取 更 有 意义 的 业务 洞察 力 。 

Hadoop 是 可 扩展 的 。 

Hadoop 是 一 个 高 度 可 扩展 的 存储 平台 ， 因 为 它 可 以 在 数 以 百 计 的 并 行 运行 的 廉价 
服务 器 的 群集 中 存储 和 分 发 非常 大 的 数据 集 。 传 统 关系 数据 库 管理 系统 (RDBMS) 的 
问题 在 于 它们 无 法 扩展 以 处 理 大 量 数据 。 


4. Hadoop 如 何 工作 


如 前 所 述 ，Hadoop 并 非 只 做 一 件 事 一 一 而 是 做 很 多 事情 。Hadoop 的 软件 库 由 四 个 
主要 部 分 〈 模 块 ) 和 许多 附加 解决 方案 (如 数据 库 和 编程 语言 ) 组 成 ， 这 增强 了 其 实际 
使 用 性 能 。 这 四 个 模块 是 : 

Hadoop Common 一 一 这 是 支持 Hadoop 模块 的 常用 工具 〈 通 用 库 ) 的 集合 。 

© Hadoop 分 布 式 文件 系统 CHDFS) 一 一 一 个 健壮 的 分 布 式 文件 系统 ， 对 存储 的 数 

据 没有 限制 〈 意 味 着 数据 可 以 是 结构 化 的 或 非 结构 化 的 、 无 模式 的 ， 其 中 许多 
DFS 将 仅 存储 结构 化 数据 ) ， 其 提供 了 有 具有 宛 余 的 高 吞吐 量 访问 CHDFS 允许 将 
数据 存储 在 多 台 机 器 上 一 一 因此 如 果 一 台 机 器 发 生 故 障 , 则 可 通过 其 他 机 器 继续 
工作 )。 

* Hadoop YARN 一 一 该 框架 负责 作业 调度 和 集群 资源 管理 ; 它 确保 数据 分 散 于 多 台 

机 器 以 保持 宛 余 。YARN 是 Hadoop 高 效 经 济 地 处 理 大 数据 的 模块 。 
e Hadoop MapReduce 一 一 用 谷歌 技术 建立 的 基于 YARN 的 系统 ， 能 对 大 型 数据 集 
(结构 化 和 非 结构 化 ) 进行 并 行 处 理 。MapReduce 也 可 以 用 于 当今 大 多 数 大 型 数 
据 处 理 框架 ， 包 括 MPP 和 NoSQL 数据 库 。 

所 有 这 些 模块 协同 工作 ， 对 大 型 数据 集 进 行 分 布 式 处 理 。Hadoop 框架 使 用 在 计算 
机 集群 中 复制 的 简单 编程 模型 ， 这 意味 着 系统 可 以 从 单个 服务 器 扩展 到 数 千 台 机 器 ， 以 
提高 处 理 能 力 ， 而 不 是 单 靠 硬件 。 

能 处 理 大 数据 的 硬件 是 昂贵 的 。Hadoop 的 真正 创新 在 于 : 把 大 量 的 处 理 能 力 分 解 
到 多 个 较 小 的 机 器 上 ， 每 个 机 器 都 具有 自己 的 本 地 化 计算 和 存储 能 力 ， 同 时 在 应 用 程序 
级 别 内 置 宛 余 以 防 出 现 故 障 。 
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Data Visualization 


Data visualization is viewed by many disciplines as a modern equivalent of visual 
communication. It involves the creation and study of the visual representation of data, 
meaning “information that has been abstracted in some schematic form, including attributes or 
variables for the units of information" . 

A primary goal of data visualization is to communicate information clearly and 
efficiently via statistical graphics, plots and information graphics. Numerical data may be 
encoded using dots, lines, or bars, to visually communicate a quantitative message. Effective 
visualization helps users analyze and reason about data and evidence. It makes complex data 
more accessible, understandable and usable. Users may have particular analytical tasks, such 
as making comparisons or understanding causality, and the design principle of the graphic 
(i.e., showing comparisons or showing causality) follows the task. Tables are generally used 
when users will look up a specific measurement, while charts of various types are used to 
show patterns or relationships in the data for one or more variables. 

Data visualization is both an art and a science. It is viewed as a branch of descriptive 
statistics by some, but also as a grounded theory development tool by others. There is an 
increasing amount of data created by Internet activity and an expanding number of sensors in 
the environment. Processing, analyzing and communicating this data present ethical and 
analytical challenges for data visualization. The data scientists help address this challenge. 
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1. Overview 


Data visualization refers to the techniques used to communicate data or information by 
encoding it as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to 
communicate information clearly and efficiently to users. It is one of the steps in data analysis 
or data science. According to Friedman, the main goal of data visualization is to communicate 
information clearly and effectively through graphical means. It doesn’t mean that data 
visualization needs to look boring to be functional or extremely sophisticated to look beautiful. 
To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, 
providing insights into a rather sparse and complex data set by communicating its key-aspects 
in a more intuitive way. Yet designers often fail to achieve a balance between form and 
function, creating gorgeous data visualizations which fail to serve their main purpose — to 
communicate information. 

Indeed, Fernanda Viegas and Martin M. Wattenberg suggested that an ideal visualization 
should not only communicate clearly, but stimulate viewer engagement and attention. 

Data visualization is closely related to information graphics, information visualization, 
scientific visualization, exploratory data analysis and statistical graphics. In the new 
millennium, data visualization has become an active area of research, teaching and 
development (see Figure 11-1). 
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Figure 11-1 Data visualization is one of the steps in analyzing data and presenting it to users. 


2. Characteristics of Effective Graphical Displays 


Professor Edward Tufte explained that users of information displays are executing 
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particular analytical tasks such as making comparisons or determining causality. The design 
principle of the information graphic should support the analytical task, showing the 
comparison or causality. 

In his book The Visual Display of Quantitative Information, Edward Tufte defines 
graphical displays and principles for effective graphical display. He holds that excellence in 
statistical graphics consists of complex ideas communicated with clarity, precision and 
efficiency. Graphical displays should: 

e show the data; 

induce the viewer to think about the substance rather than about methodology, graphic 

design, the technology of graphic production or something else; 

avoid distorting what the data has to say; 

* present many numbers in a small space; 

e make large data sets coherent; 

* encourage the eye to compare different pieces of data; 

* reveal the data at several levels of detail, from a broad overview to the fine structure; 

* serve a reasonably clear purpose: description, exploration, tabulation or decoration; 

and 

* be closely integrated with the statistical and verbal descriptions of a data set. 

Indeed graphics can be more precise and revealing than conventional statistical 
computations. 

Not applying these principles may result in misleading graphs, which distort the message 
or support an erroneous conclusion. Needlessly separating, the explanatory key from the 
image itself requires the eye to travel back and forth from the image to the key. 

The Congressional Budget Office summarized several best practices for graphical 
displays in a June 2014 presentation. These included: 

* Knowing your audience; 

* Designing graphics that can stand alone outside the context of the report; and 

* Designing graphics that communicate the key messages in the report. 


3. Quantitative messages 


Author Stephen Few describes eight types of quantitative messages that users may 
attempt to understand or communicate from a set of data and the associated graphs used to 
help communicate the message: 

e Time-series: A single variable is captured over a period of time, such as the 

unemployment rate over a 10-year period. A line chart may be used to demonstrate the 
trend (see Figure 11-2). 
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Source: Congressional Budget Office. 
Figure 11-2 A time series illustrated with a line chart demonstrating trends in 
U.S. federal spending and revenue over time. 


e Ranking: Categorical subdivisions are ranked in ascending or descending order, such 
as a ranking of sales performance (the measure) by sales persons (the category, with 
each sales person a categorical subdivision) during a single period. A bar chart may be 
used to show the comparison across the sales persons. 

* Part-to-whole: Categorical subdivisions are measured as a ratio to the whole (ie., a 
percentage out of 10096). A pie chart or bar chart can show the comparison of ratios, 
such as the market share represented by competitors in a market. 

* Deviation: Categorical subdivisions are compared against a reference, such as a 
comparison of actual vs. budget expenses for several departments of a business for a 
given time period. A bar chart can show comparison of the actual versus the reference 
amount. 

* Frequency distribution: Shows the number of observations of a particular variable for 
given intervals, such as the number of years in which the stock market return is 
between intervals such as 0-10%, 1196-2095, etc. A histogram, a type of bar chart, may 
be used for this analysis. A boxplot helps visualize key statistics about the distribution, 
such as median, quartiles, outliers, etc. 

* Correlation: Comparison between observations represented by two variables (X, Y) to 
determine if they tend to move in the same or opposite directions. For example, 
plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is 
typically used for this message (see Figure 11-3). 
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U.S. Phillips Curve: Inflation vs Unemployment - 1/2000 to 8/2014 
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Source Dato: FRED Datobase 
Inflation: CPI for All Urban Consumers 
Figure 11-3 A scatter plot illustrating negative correlation between two variables 
(inflation and unemployment) measured at points in time. 


e Nominal comparison: Comparing categorical subdivisions in no particular order, such 
as the sales volume by product code. A bar chart may be used for this comparison. 

* Geographic or geospatial: Comparison of a variable across a map or layout, such as the 
unemployment rate by state or the number of persons on the various floors of a 
building. A cartogram is a typical graphic used. 

Analysts reviewing a set of data may consider whether some or all of the messages and 

graphic types above are applicable to their task and audience. The process of trial and error to 
identify meaningful relationships and messages in the data is part of exploratory data analysis. 


4. Visual Perception and Data Visualization 


A human can distinguish differences in line length, shape, orientation, and color (hue) 
readily without significant processing effort; these are referred to as “ pre-attentive attributes.” 
For example, it may require significant time and effort (attentive processing) to identify the 
number of times the digit “5” appears in a series of numbers; but if that digit is different in 
size, orientation, or color, instances of the digit can be noted quickly through pre-attentive 
processing. 


Effective graphics take advantage of pre-attentive processing and attributes and the 
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telative strength of these attributes. For example, since humans can more easily process 
differences in line length than surface area, it may be more effective to use a bar chart (which 
takes advantage of line length to show comparison) rather than pie charts (which use surface 
area to show comparison). 

Almost all data visualizations are created for human consumption. Knowledge of human 
perception and cognition is necessary when designing intuitive visualizations. Cognition 
refers to processes in human beings like perception, attention, learning, memory, thought, 
concept formation, reading, and problem solving. Human visual processing is efficient in 
detecting changes and making comparisons between quantities, sizes, shapes and variations in 
lightness. When properties of symbolic data are mapped to visual properties, humans can 
browse through large amounts of data efficiently. It is estimated that 2/3 of the brain’s 
neurons can be involved in visual processing. Proper visualization provides a different 
approach to show potential connections, relationships, etc. which are not as obvious in 
non-visualized quantitative data. Visualization can become a means of data exploration. 


5. Terminology 


Data visualization involves specific terminology, some of which is derived from 
statistics. For example, author Stephen Few defines two types of data, which are used in 
combination to support a meaningful analysis or visualization: 

© Categorical: Text labels describing the nature of the data, such as “Name” or “Age” . 

This term also covers qualitative (nonnumerical) data. 

© Quantitative: Numerical measures, such as “25” to represent the age in years. 

Two primary types of information displays are tables and graphs. 

A table contains quantitative data organized into rows and columns with categorical 
labels. It is primarily used to look up specific values. In the example above, the table might 
have categorical column labels representing the name (a qualitative variable) and age (a 
quantitative variable), with each row of data representing one person (the sampled 
experimental unit or category subdivision). 

A graph is primarily used to show relationships among data and portrays values encoded 
as visual objects (e.g., lines, bars, or points). Numerical values are displayed within an area 
delineated by one or more axes. These axes provide scales (quantitative and categorical) used 
to label and assign values to the visual objects. Many graphs are also referred to as charts. 


6. Data Presentation Architecture 


Data presentation architecture (DPA) is a skill-set that seeks to identify, locate, 
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manipulate, format and present data in such a way as to optimally communicate meaning and 
proper knowledge. 

Historically, the term data presentation architecture is attributed to Kelly Lautt. Data 
Presentation Architecture (DPA) is a rarely applied skill set critical for the success and value 
of Business Intelligence. DPA is neither an IT nor a business skill set but exists as a separate 
field of expertise. Often confused with data visualization, data presentation architecture is a 
much broader skill set that includes determining what data on what schedule and in what 
exact format is to be presented, not just the best way to present data that has already been 
chosen. Data visualization skills are one element of DPA. 


6.1 Objectives 


DPA has two main objectives: 

* To use data to provide knowledge in the most efficient manner possible (minimize 
noise, complexity, and unnecessary data or detail given each audience’s needs and 
roles). 

e To use data to provide knowledge in the most effective manner possible (provide 
relevant, timely and complete data to each audience member in a clear and 
understandable manner that conveys important meaning, is actionable and can affect 
understanding, behavior and decisions). 


6.2 Scope 


With the above objectives in mind, the actual work of data presentation architecture 
consists of: 

* Creating effective delivery mechanisms for each audience member depending on their 
role, tasks, locations and access to technology 

* Defining important meaning (relevant knowledge) that is needed by each audience 
member in each context 

* Determining the required periodicity of data updates (the currency of the data) 

* Determining the right timing for data presentation (when and how often the user needs 
to see the data) 

* Finding the right data (subject area, historical reach, breadth, level of detail, etc.) 

e Utilizing appropriate analysis, grouping, visualization, and other presentation formats 


6.3 Related fields 
DPA work shares commonalities with several other fields, including: 


* Business analysis in determining business goals, collecting requirements, mapping 


processes. 
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e Business process improvement in that its goal is to improve and streamline actions and 
decisions in furtherance of business goals. 

e Data visualization in that it uses well-established theories of visualization to add or 
highlight meaning or importance in data presentation. 

* Graphic or user design: As the term DPA is used, it falls just short of design in that it 
does not consider such detail as colour palates, styling, branding and other aesthetic 
concerns, unless these design elements are specifically required or beneficial for 
communication of meaning, impact, severity or other information of business value. 
For example: 

(1) choosing locations for various data presentation elements on a presentation 
page (such as in a company portal, in a report or on a web page) in order to convey 
hierarchy, priority, importance or a rational progression for the user is part of the DPA 
skill-set; 

(2) choosing to provide a specific colour in graphical elements that represent data 
of specific meaning or concern is part of the DPA skill-set 

e Information architecture, but information architecture's focus is on unstructured data 
and therefore excludes both analysis (in the statistical/data sense) and direct 
transformation of the actual content (data, for DPA) into new entities and combinations. 


XW New Words 
discipline [disiplin] 放学 科 
yy. 训练 
variable [vsariabl] nJE*, TE, RE 
ad AEH, FEN, SRA, RW 
plot [plot] nn. 图 
v 
dot [dət] nk, BR 
vt 在 …… 上 打点 
evidence ['evidens] ne, SF, Ho, x, RE 
usable [ju:zebl] adj. 可 用 的 ， 便 于 使 用 的 
causality [ko: zzeliti] n.E X X 
measurement [mezsement] nJ Ei, 度量 ，( 量 得 的 ) 尺寸 ; 度量 单位 制 
chart [tfa:t] 17. 图 表 
Wt 制图 
sensor [sense] n.r RE 
aesthetic [i:s'Betik] adj. 美 学 的 ， 审 美的 
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sparse 
attention 
millennium 
precision 
distort 
coherent 
reveal 
tabulation 
decoration 
verbal 
precise 


misleading 
chartjunk 
extraneous 
interior 


gratuitous 
explanatory 
debris 
unemployment 
deviation 
interval 


histogram 
quartile 
outlier 
inflation 
cartogram 
perception 
hue 
cognition 
neuron 
combination 
categorical 
qualitative 
nonnumerical 


[spa:s] 
[a'tenf en] 
[mi'leniem] 
[pri'sizen] 
[dis'to:t] 
[Keuhierent] 
[ri vi:l] 
[.teebju'leif en] 
[;deke'reif en] 
[ve:bel] 


[pri'sais] 


[mis'li:din] 
[tf a:tdzAnk] 
[eks'treinjes] 
[in'tiaria] 


[gre tju:ites] 
[iks'plaeneteri] 
[debri:] 
Lanim'ploiment] 
[.di:vi'eif en] 
[intevel] 


[histeugraem] 
[kwao:tail] 
[autlaie] 
[in'fleif en] 
[ka:tagreem] 
[pe'sepf en] 
[hju:] 

[kog'nif en] 
[njueron] 
Lkombi'neif en] 
[.Keeti'gorikel] 
[Kwolitetiv] 
[nonnju'merikel] 


adj Fi) B1, d pL Ed 
njEBA, XX 

1. 千年 

n. 精 确 ， 精 密度 ， 精 度 
vt. d (事实 等 ) ， 误 报 


adj.— B], 3E SEIN 
VI 展现， 显示 ， 揭 示 
n. 作 表 ， 表 格 
nb, RM; 装饰 品 
adj. 口 头 的 

adj. 精 确 的 ， 准 确 的 
nn. 精确 

adj. 易 误解 的 ， 令 人 误解 的 
.垃圾 图 表 

adj. X X fth 
adj. 内 部 的 ， 内 的 
nn. 内 部 


adj. 没 必要 的 ， 无 理由 的 
adj. 说 明 的 ， 解 释 性 的 


nH, AE 

n. 失 业 ， 失 业 人 数 
171. 偏差 ， 背 离 
nll, JER 

n. 时 间 间 隔 
.柱状 
.四 分 位 数 

nn. 离 群 值 ， 异 常 值 


n. 通 货 膨胀 ， 物 价 暴 涨 

.统计 地 图 ， 变 形 地 图 ; 属性 地 图 

nn 理解 ， 感知 ， 感 觉 

ni, Re, EY 

7. 认识 

.神经 细胞 ， 神 经 元 

ne, KE, SH 

adj. 分 类 的 ， 按 类 别 的 ; 无 条 件 的 ， 绝 对 的 
adj. 定 性 的 ， 性 质 上 的 

adj. 非 数值 的 


quantitative [‘kwontitetiv] 
subdivision [s^bdi.vizen] 
portray [po:'trei] 
delineate [di'linieit] 
axes [zeksi:z] 
scale [skeil] 
intelligence [in'telidgens] 
expertise Lekspe'ti:z] 
confused [kanfju:zd] 
exact [ig'zeekt] 
complexity [kem'pleksiti] 
unnecessary [^n'nesiseri] 
role [reul] 
convey [ken vei] 
periodicity [.pierie'disiti] 
grouping ['gru:pin] 
improvement [im'pru:vment] 
streamline ['stri:mlain] 
severity [si veriti] 
portal [po:tel] 
hierarchy [haiera:ki] 
priority [prai'oriti] 
rational [raef enl] 
exclude [iks'klu:d] 
XA Phrases 


data visualization 
visual communication 
visual representation 
statistical graphics 
design principle 
descriptive statistics 
grounded theory 
Internet of things 

data analysis 

data science 
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adj. 数 量 的 ; 定量 的 

nn. 细 分 ， 一 部 

DE I MEE 

vā, JE 

nn. 轴 

7 刻度， 衡量 ， 比 例 ， 数 值 范围 ， 等 级 
nn, fü 

1. 专 家 的 意见 ， 专 门 技术 
adfy. 困 惑 的 ， 烦 恼 的 
adj. 精 确 的 ， 准 确 的 

nn 复杂 性 ; 复杂 的 事物 
adj. 不 必要 的 ， 多 余 的 


nfe, (£5 
WL 传达 ， 转 让 
n. 周 期 
nay Al 
n. 改 进 ， 进 步 


v. 使 现代 化 ; TUI, 简化; 使 成 流线型 
nn 严肃 ， 严 格 ， 严 重 ， 激 烈 


YE 
.层次 
nh, KAR 


adj. 理 性 的 ， 合 理 的 ， 推 理 的 
Wt. 拒绝 接纳 ， 排 斥 


数据 可 视 化 
视觉 传达 ， 视 觉 传播 
直观 表示 
统计 图 ， 统 计 图 形 学 
设计 原理 

描述 统计 学 

扎根 理论 

物 联网 
数据 分 析 
数据 科学 
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hand in hand 携手 ， 手 拉手 ; 密切 合作 
data set 数据 集 

failto 未 能 2 

broad overview 宏观 视角 ,概览 

fine structure 精细 结构 

integrate with ... d ee 结合 
Congressional Budget Office 美国 会 预算 办 公 室 
line chart 线形 图 ， 线 图 ， 线 形 图 表 
categorical subdivision 类 别 细 分 

ascending order 升序 

descending order 降序 

bar chart 柱状 图 

pie chart D: 

market share 市 场 份额 

frequency distribution 频率 分 布 

opposite direction 反 向 ， 相 反方 向 
scatter plot 散 点 图 

trial and error 反复 试验 
pre-attentive attribute 前 注意 属性 

be involved in 涉及 ， 专 心 

be derived from 源 自 于 

text label 文本 标签 , 文字 标签 ; 文字 标记 , 文本 标号 
numerical measure 数字 型 度量 

visual object 视觉 对 象 ， 视 频 对 象 
Periodic Table of Visualization 可 视 化 方法 周期 表 
Methods 

business intelligence 商业 智能 

business analysis 商业 分 析 

colour palate 调 色 板 

web page 网 页 
XA Abbreviations 


DPA (Data Presentation Architecture) ”数据 呈现 结构 
IT (Information Technology) 信息 技术 
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XA Notes 


[1] Tables are generally used when users will look up a specific measurement, while charts of 

various types are used to show patterns or relationships in the data for one or more 
variables. 
AAI}, while 是 一 个 连词 ， 连 接 两 个 并 列 的 句子 ， 表 示 对 比 ， 意 思 是 “而 ”。when 
users will look up a specific measurement 是 一 个 时 间 状 语 从 句 ， 修 饰 谓语 are generally 
used. to show patterns or relationships in the data for one or more variables 是 一 个 动词 
不 定式 短语 ， 作 目的 状语 。 

[2] To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, 
providing insights into a rather sparse and complex data set by communicating its 


key-aspects in a more intuitive way. 
本 人 句 中 ，To convey ideas effectively 是 一 个 动词 不 定式 短语 ， 作 目的 状语 。providing 
insights into a rather sparse and complex data set by communicating its key-aspects in a 
more intuitive way 是 一 个 现在 分 词 短语 ， 作 结果 状语 。 

[3] Not applying these principles may result in misleading graphs, which distort the message 


or support an erroneous conclusion. 

本 句 中 ，Not applying these principles 是 一 个 动 名 词 短语 ， 作 主语 。which distort the 
message or support an erroneous conclusion 是 一 个 非 限 定性 定语 从 句 ， 对 宾语 
misleading graphs 进行 补充 说 明 。result in 的 意思 是 “导致 ”。 

[4] For example, since humans can more easily process differences in line length than surface 
area, it may be more effective to use a bar chart (which takes advantage of line length to 
show comparison) rather than pie charts (which use surface area to show comparison). 
本 句 中 ,since humans can more easily process differences in line length than surface area 
是 一 个 原因 状语 从 句 。 (which takes advantage of line length to show comparison) 是 一 个 
定语 从 句 ， 修 饰 和 限定 a bar chart. (which use surface area to show comparison) 也 是 一 
个 定语 从 名， 修饰 和 限定 pie charts. 

[5] Often confused with data visualization, data presentation architecture is a much broader 


skill set that includes determining what data on what schedule and in what exact format is 
to be presented, not just the best way to present data that has already been chosen. 

本 句 中 , that includes determining what data on what schedule and in what exact format is 
to be presented 是 一 个 定语 从 句 ， 修 饰 和 限定 a much broader skill set. that has already 
been chosen 也 是 一 个 定语 从 句 ， 修 饰 和 限定 data. 
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XA Exercises 


【Ex. 1】 根据 课文 内 容 回答 问题 。 


1. 
2. 
3. 
4. 


6. 
LA 


9. 


What is data visualization viewed by many disciplines as? 

What is a primary goal of data visualization? 

What does data visualization refer to? 

What are the several best practices for graphical displays the Congressional Budget Office 
summarized in a June 2014 presentation? 


. What are the eight types of quantitative messages Author Stephen Few describes that users 


may attempt to understand or communicate from a set of data and the associated graphs 
used to help communicate the message? 

What does cognition refer to? 

What is efficient in detecting changes and making comparisons between quantities, sizes, 
shapes and variations in lightness? 


. What are the two primary types of information displays mentioned in the passage? What 


are they primarily used to respectively? 
What is data presentation architecture (DPA) ? 


10. How many main objectives does DPA have? What are they? 


[Ex 2] 把 下 列 句 子 翻译 为 中 文 。 


1 
2 


. Each input parameter should have the variable name and its value. 
. If a computer user fails to log off, the system is accessible to all. 
3. 


User experience designers are great at making software friendly and usable for new 
customers. 


. There are no previous statistics for comparison. 
5. On modern hardware and operating systems, it can deliver accuracy and precision in the 


microsecond range. 


. A histogram is used to graphically summarize and display the distribution of a process data 


set. 


. You will need hardware, software, and network expertise. 


8. This considerably reduces the debugging time and complexity. 


. In addition, unnecessary processing time and resources are being consumed starting and 


stopping the transaction. 


10. This may have been an improvement, but “breakthrough” was an overstatement. 
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【Ex. 3】 短文 翻译 。 
7 Important Types of Big Data 


Big data is a term thrown around in a lot of articles, and for those who understand what 
big data means that is fine, but for those struggling to understand exactly what big data is, it 
can get frustrating. There are several definitions of big data as it is frequently used as an 
all-encompassing term for everything from actual data sets to big data technology and big 
data analytics. However, this article will focus on the actual types of data that are contributing 
to the ever growing collection of data referred to as big data. Specifically we focus on the data 
created outside of an organization, which can be grouped into two broad categories: structured 
and unstructured. 


1. Structured Data 


11 Created 


Created data is just that; data businesses purposely create, generally for market research. 
This may consist of customer surveys or focus groups. It also includes more modern methods 
of research, such as creating a loyalty program that collects consumer information or asking 
users to create an account and login while they are shopping online. 


1.2 Provoked 


A Forbes Article defined provoked data as, “Giving people the opportunity to express 
their views.” Every time a customer rates a restaurant, an employee, a purchasing experience 
or a product they are creating provoked data. Rating sites, such as Yelp, also generate this 
type of data. 


1.3 Transacted 


Transactional data is also fairly self-explanatory. Businesses collect data on every 
transaction completed, whether the purchase is completed through an online shopping cart or 
in-store at the cash register. Businesses also collect data on the steps that lead to a purchase 
online. For example, a customer may click on a banner ad that leads them to the product pages 
which then spurs a purchase. 

As explained by the Forbes article, “Transacted data is a powerful way to understand 
exactly what was bought, where it was bought, and when. Matching this type of data with 
other information, such as weather, can yield even more insights. 
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14 Compiled 


Compiled data is giant databases of data collected on every U.S. household. Companies 
like Acxiom collect information on things like credit scores, location, demographics, 
purchases and registered cars that marketing companies can then access for supplemental 


consumer data. 
1.5 Experimental 


Experimental data is created when businesses experiment with different marketing pieces 
and messages to see which are most effective with consumers. You can also look at 
experimental data as a combination of created and transactional data. 


2. Unstructured Data 


People in the business world are generally very familiar with the types of structured data 
mentioned above. However, unstructured is a little less familiar not because there’s less of it, 
but before technologies like NoSQL and Hadoop came along, harnessing unstructured data 
wasn’t possible. In fact, most data being created today is unstructured. Unstructured data, as 
the name suggests, lacks structure. It can’t be gathered based on clicks, purchases or a 
barcode, so what is it exactly? 


2.1 Captured 


Captured data is created passively due to a person’s behavior. Every time someone enters 
a search term on Google that is data that can be captured for future benefit. The GPS info on 
our smartphones is another example of passive data that can be captured with big data 
technologies. 


2.2 User-generated 


User-generated data consists of all of the data individuals are putting on the Internet 
every day. From tweets, to Facebook posts, to comments on news stories, to videos put up on 
YouTube, individuals are creating a huge amount of data that businesses can use to better 
target consumers and get feedback on products. 

Big data is made up of many different types of data. The seven listed above comprise 
types of external data included in the big data spectrum. There are, of course, many types of 
internal data that contribute to big data as well, but hopefully breaking down the types of data 
helps you to better see why combining all of this data into big data is so powerful for 
business. 
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[Ex 4] 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


track sophisticated correlations databases insights 


visualization algorithms accumulated manipulate environments 


Data Visualization 


Data visualization is a general term that describes any effort to help people understand 
the significance of data by placing it in a visual context. Patterns, trends and correlations that 
might go undetected in text-based data can be exposed and recognized easier with data 
visualization software. 

Today’s data visualization tools go beyond the standard charts and graphs used in 
Microsoft Excel spreadsheets, displaying data in more _ (1) _ ways such as infographics, 
dials and gauges, geographic maps, sparklines, heat maps, and detailed bar, pie and fever 
charts. The images may include interactive capabilities, enabling users to _ (2) — them or 
drill into the data for querying and analysis. Indicators designed to alert users when data has 
been updated or predefined conditions occur can also be included. 


1. Importance of data visualization 


Data visualization has become the de facto standard for modern business intelligence(BI). 
The success of the two leading vendors in the BI space, Tableau and Qlik —both of which 
heavily emphasize (3) | — has moved other vendors toward a more visual approach in 
their software. Virtually all BI software has strong data visualization functionality. 

Data visualization tools have been important in democratizing data and analytics and 
making data-driven _ (4) available to workers throughout an organization. They are 
typically easier to operate than traditional statistical analysis software or earlier versions of BI 
software. This has led to a rise in lines of business implementing data visualization tools on 
their own, without support from IT. 

Data visualization software also plays an important role in big data and advanced 
analytics projects. As businesses — (5) — massive troves of data during the early years of the 
big data trend, they needed a way to quickly and easily get an overview of their data. 
Visualization tools were a natural fit. 

Visualization is central to advanced analytics for similar reasons. When a data scientist is 
writing advanced predictive analytics or machine learning — (6) , it becomes important to 
visualize the outputs to monitor results and ensure that models are performing as intended. 
This is because visualizations of complex algorithms are generally easier to interpret than 
numerical outputs. 
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2. Examples of data visualization 


Data visualization tools can be used in a variety of ways. The most common use today is 
as a BI reporting tool. Users can set up visualization tools to generate automatic 
dashboards that track company performance across key performance indicators and visually 
interpret the results. 

Many business departments implement data visualization software to _ (7) _ their own 
initiatives. For example, a marketing team might implement the software to monitor the 
performance of an email campaign, tracking metrics like open rate, click-through 
rateand conversion rate. 

As data visualization. vendors extend the functionality of these tools, they are 
increasingly being used as front ends for more sophisticated big data — (8) .In this setting, 
data visualization software helps data engineers and scientists keep track of data sources and 
do basic exploratory analysis of data sets prior to or after more detailed advanced analyses. 


3. How data visualization works 


Most of today's data visualization tools come with connectors to popular data sources, 
including the most common relational __(9)__, Hadoop and a variety of cloud storage 
platforms. The visualization software pulls in data from these sources and applies a graphic 
type to the data. 

Data visualization software allows the user to select the best way of presenting the data, 
but, increasingly, software automates this step. Some tools automatically interpret the shape 
of the data and detect _ (10) _ between certain variables and then place these discoveries 
into the chart type that the software determines is optimal. 

Typically, data visualization software has a dashboard component that allows users to 
pull multiple visualizations of analyses into a single interface, generally a web portal. 


Text B 


The 14 Best Data Visualization Tools 


Raw data is boring and it’s difficult to make sense of it in its natural form. Add 
visualization to it and you get something that everybody can easily digest. You can not only 
make sense of it faster, but also observe interesting patterns that wouldn’t be apparent from 
looking only at stats. 


| unit 11 617) 


To make the tedious task of making beautiful charts and maps easier, Tve made the list 
of best data visualization tools available for the job. I've divided the list into two parts; first 
covers the tools that require coding and are meant for developers, while the second list 
contains data visualization software products that don’t require any coding. 

Let’s get started! 


1. For Developers 


11 D3js 


D3 js, short for“ Data Driven Documents”, is the first name that comes to mind when we 
think of a Data Visualization Software. It uses HTML, CSS, and SVG to render some 
amazing charts and diagrams. If you can imagine any visualization, you can do it with D3. It 
is feature packed, interactivity rich and extremely beautiful. Most of all, it’s free and 
open-source. 

It doesn't ship with pre-built charts out of the box, but has anice gallery which 
showcases what's possible with D3. There are two major concerns with D3.js: it has a steep 
learning curve and it is compatible only with modern browsers (IE 9+). So pick it up only 
when you have enough time in hand and are not concerned about displaying your charts on 
older browsers. 


1.2 FusionCharts 


FusionCharts has probably the most exhaustive collection of charts and maps. With over 
90+ chart types and 965 maps, you'll find everything that you need right out of the box. It not 
only supports modern browsers, but also older browsers starting from IE 6. 

FusionCharts supports both JSON and XML data formats, and you can export charts in 
PNG, JPEG, SVG or PDF. They have a nice collection of business dashboards and live 
demos for inspiration. 

Their charts and maps work across all devices and platforms, are highly customizable 
and have beautiful interactions. One thing to keep in mind about FusionCharts is that it's 
slightly expensive. But you can always get started with their unrestricted free trial and then 
buy if you like it. 


13 Chart.js 


Chart.js is a tiny open source library that supports just six chart types: line, bar, radar, 
polar, pie and doughnut. But the reason I like it is that sometimes that's all the charts one 
needs for a project. If the application is big and complex, then libraries like Google Charts 


218 


大 数据 专业 英语 教程 


and FusionCharts makes sense, otherwise for small hobby projects Chart.js is the perfect 
solution. 

It uses HTMLS canvas element for rendering charts. All the charts are responsive and 
use flat design. It is one of the most popular open-source charting libraries to emerge recently. 
Check out the documentation for live examples of all six chart types. 


1.4 Google Charts 


Google Charts renders charts in HTMLS/SVG to provide cross-browser compatibility 
and cross-platform portability to iPhones and Android. It also includes VML for supporting 
older IE versions. 

It offers a decent number of charts which covers the most commonly used chart types 
like bar, area, pie and gauges. It is flexible and user friendly (because Google!). You can 
view this gallery to get an idea of various charts and the type of interactions to expect. 


1.5 Highcharts 


Highcharts is another big player in the charting space. Like FusionCharts, it also offers a 
diverse range of charts and maps right out of the box. Other than normal charts, it also offers a 
different package for stock charts called Highstock which is also feature rich. 

It allows exporting charts in PNG, JPG, SVG and PDF. You can view the various chart 
types it offers in the demo section. Highcharts is free for non-commercial and personal use, 
but you will have to buy a license for deploying it in commercial applications. 


1.6 Leaflet 


Leaflet is an open-source library developed by Vladimir Agafonkin for mobile-friendly 
interactive maps. It is extremely light (at just 33kb) and has lots of features for making any 
kind of maps. It uses HTMLS and CSS3 for rendering maps, and works across all major 
desktop and mobile platforms. In the words of Vladimir Agafonkin, Leaflet is designed with 
simplicity, performance and usability in mind. 

There is a wide range of plugins available for adding features like animated markers, 
heatmaps etc. that extend the core functionality. If you are thinking of developing an 
application that involves maps, you should give Leaflet a try. 


1.7 Dygraphs 


Dygraphs is an open-source JavaScript charting library for handling huge data sets. It's 
fast, flexible and highly customizable. It works in all major browsers (including IE8) and has 
an active community. 
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Dygraphs has defined a niche use case for itself and won’t be the perfect solution for all 
your needs. But it will work for you more often than not whenever you are handling large 
datasets. To explore what is possible, check out this nicely designed demo gallery. 


2. Non-Developers 


2.1 Datawrapper 


Datawrapper is an online tool for making interactive charts. Once you upload the data 
from CSV file or paste it directly into the field, Datawrapper will generate a bar, line or any 
other related visualization. Many reporters and news organizations use Datawrapper to embed 
live charts into their articles. It is very easy to use and produces effective graphics. 


2.2 Tableau 


Tableau Public is perhaps the most popular visualization tool which supports a wide 
variety of charts, graphs, maps and other graphics. It is a completely free tool and the charts 
you make with it can be easily embedded in any web page. They have a nice gallery which 
displays visualizations created via Tableau. 

Although it offers charts and graphics that are much better than other similar tools, I 
don’t ‘love’ to use its free version because of the big footer it comes with. If it’s not as big a 
turn-off for you as it is for me, then you should definitely give it a try. Or if you can afford it, 
you can go for a paid version. 


2.3 Raw 


Raw defines itself as the missing link between spreadsheets and vector graphics. It is 
built on top of D3.js and is extremely well designed. It has such an intuitive interface that 
you'll feel like you've used it before. It is open-source and doesn’t require any registration. 

It has a library of 16 chart types to choose from and all the processing is done in browser. 
So your data is safe. RAW is highly customizable and extensible, and can even accept new 
custom layouts. 


24 Timeline JS 


As the name suggests, Timeline JS helps you create beautiful timelines without writing 
any code. It is a free, open-source tool which is used by some of the most popular websites 
like Time and Radiolab. 

It’s very easy to follow four-step process to create your timeline which is explained here. 
Best part? It can pull in media from a variety of sources and has built-in support for Twitter, 


220) 大 数据 专业 英语 教程 


Flickr, Google Maps, YouTube, Vimeo, Vine, Dailymotion, Wikipedia, SoundCloud and 
other similar sites. 


2.5 Infogram 


Infogram enables you to create both charts and infographics online. It has a restricted 
free version and two paid options which include features like 200+ maps, private sharing and 
icons library etc. 

It comes with an easy-to-use interface and its basic charts are well designed. One feature 
that I don’t like is the huge logo that you get when you try to embed interactive charts into 
your webpage (in free version). It will be better if they can make it like the little text that 
Datawrapper uses. 


2.6 Plotly 


Plotly is a web-based data analysis and graphing tool. It supports a good collection of 
chart types with built in social sharing features. The charts and graph types available have a 
professional look and feel. Creating a chart is just a matter of loading in your information and 
customizing the layout, axes, notes and legend. If you are looking to get started, you can find 
some inspiration here. 


2.7 ChartBlocks 


ChartBlocks is another online chart builder that is well designed and allows you to build 
basic charts very quickly. It has a limited number of chart types, but that will not be a problem 
as most common chart types are covered. 

It allows you to pull in data from multiple external sources like spreadsheets and 
databases. After you have made the chart, you can either export it via SVG or PNG, embed it 
in your website or share it on social media. 


XW New Words 
digest [dai'dzest] wt. 消化 ， 理 解 ; 融会 贯通 ; 分 类 ; 整理 
[‘daidgest] 0 分类， 摘要 
stats [staets] 7 统计 学 ， 统 计 表 (statistics) 
tedious [ti:dies] adj WAH, JUK Z vk 
diagram [daiegraem] nn. 图表 
gallery [‘geleri] .图 库 
showcase [Jaukeis] n.《 商 店 或 博物 馆 的 玻璃 ) 陈列 橱 


exhaustive [igzo:stiv] adfy. 无 遗漏 的 ， 彻 底 的 ， 详 尺 的 


dashboard [daef.ba:d] 
demo [demeu] 
inspiration [inspe'reif en] 
customizable [K^stemaizebl] 
doughnut [deunat] 
hobby [hobi] 
responsive [ris'ponsiv] 
compatibility [kom peeti'biliti] 
gauge [ged3] 
usability [ju:zebileti] 
heatmap [hi:tmaep] 
intuitive [in'tju:itiv] 
registration [redzis'treif an] 
extensible [ik'stensibl] 
timeline [taimlain] 
restricted [ris'triktid] 

XA Phrases 


make sense of... 
Data Driven Documents 
steep learning curve 
be compatible with 
pick up on ... 

be concerned about 
free trial 

canvas element 
rendering chart 

flat design 

stock chart 
animated marker 
use case 

more often than not 
web page 

turn-off 


a variety of 


.仪表 板 

.演示 

nn. 灵感 

.用 户 化 ， 专 用 化 ， 定 制 
nA 

n. X 

adj. 响 应 的 

n3 E 
nR, EX, Be 
17. 测 量 

.可 用 性 

17. 热 图 

adj. 直 觉 的 

nèit, Bid 

adj. 可 扩展 的 ， 可 延长 的 
nn. 时间 轴 ， 时 间 线 
adj. 受 限制 的 ， 有 限 的 


搞 清 …… 的 意思 
数据 驱动 的 文档 
陡峭 的 学 习 曲 线 


扁平 化 设计 

股票 图 

动画 制作 器 ， 动 画 制作 程序 
用 例 

往往 ， 多 半 

网 页 

使 人 扫兴 (或 倒 胃口 ) 的 事物 
多 种 的 
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Social sharing 社交 分 享 
limited number 少数 
Social media 社交 媒体 

XA Abbreviations 
CSS (Cascading Style Sheets) BERAK 
SVG (Scalable Vector Graphics) 指 可 伸缩 矢量 图 形 
JSON (JavaScript Object Notation) JS 对 象 标记 
XML (eXtensible Markup Language) 可 扩展 标记 语言 
PNG (Portable Network Graphic) 便携 式 网 络 图 像 
JPEG (Joint Photographic Experts Group) ”联合 图 像 专 家 小 组 
PDF (Portable Document Format) 便携 式 文档 格式 
CSV (Comma-Separated Values) 逗号 分 隔 值 ， 字 符 分 隔 值 


XA Exercises 


[Ex 5] 根据 课文 内 容 填空 。 

1. D3 js is short for . It is the first name that comes to mind when we think of 
a Data Visualization Software. It uses š , and to render 
some amazing charts and diagrams. 

2. FusionCharts supports both and data formats, and you can export 
charts in PNG, JPEG, SVG or PDF. 

3. Chartjs is a tiny that supports just six chart types: 
radar, polar, and doughnut. 

4. Google Charts offers a decent number of charts which covers the most commonly used 
chart types like š k and x 

5. Dygraphsis an open-source for handling huge data sets. It’s fast, 

and highly . It works in all major browsers (including IE8) 
and has : 

6. Datawrapper is an online tool for making . Once you upload the data from 
CSV file or paste it directly into , Datawrapper will generate a bar, line or 
any other . 

7. Tableau Public is perhaps the most popular visualization tool which supports a wide variety 
of k 5 and other graphics. 

8. Raw defines itself as the missing link between and . It is built 


on top of D3.js and is extremely well designed. It has a library of to choose 
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from and all the processing is done in A 
9. Infogram enables you to create both charts and infographics . It has a 
and two paid options which include features like 
and icons library etc. 


10. Plotly is a data analysis and graphing tool. It supports a good collection of 
chart types with 
参考 译文 
数据 可 视 化 


数据 可 视 化 被 许多 学 科 视 为 视觉 传达 的 现代 方式 。 它 涉及 创建 和 研究 数据 的 视觉 表 
示 ， 意 味 着 “信息 已 经 抽象 成 一 些 图 像 形式 ， 包 括 信息 单元 的 属性 或 变量 ”。 

数据 可 视 化 的 主要 目标 是 通过 统计 图 形 、 绘 图 和 信息 图 形 清晰 有 效 地 传达 信息 。 可 
以 使 用 点 、 线 或 条 来 对 数字 数据 进行 编码 ， 以 可 视 化 地 传达 定量 消息 。 有 效 的 可 视 化 帮 
助 用 户 分 析 和 理解 数据 和 证 据 。 它 使 复杂 的 数据 更 易于 访问 、 可 理解 和 可 用 。 用 户 可 以 
执行 特定 的 分 析 任务 ， 例 如 进行 比较 或 理解 因果 关系 以 及 图 形 的 设计 原理 〈 即 ， 显 示 比 
较 或 显示 因果 关系 ) 。 用 户 通常 使 用 表格 来 查找 特定 的 度量 ， 而 各 种 类 型 的 图 表 用 于 显 
示 数 据 中 的 模式 以 及 一 个 或 多 个 变量 的 关系 。 

数据 可 视 化 既是 艺术 又 是 科学 。 一 些 人 将 其 视 为 描述 性 统计 的 一 个 分 支 ， 其 他 人 视 
其 为 扎根 理论 开发 工具 。 互 联网 活动 和 越 来 越 多 的 环境 传感器 制造 的 数据 越 来 越 多 。 处 
理 、 分 析 和 传达 这 些 数 据 是 数据 可 视 化 所 面临 的 伦理 和 分 析 挑战 。 数 据 科 学 家 帮助 解决 
了 这 一 挑战 。 


1. 概述 


数据 可 视 化 是 指 用 于 传达 数据 或 信息 的 技术 , 它 通过 将 数据 或 信息 编码 为 图 形 中 的 
视觉 对 象 〈 例 如 点 、 线 或 条 ) 来 实现 。 其 目标 是 向 用 户 清楚 有 效 地 传达 信息 。 它 是 数据 
分 析 或 数据 科学 的 步骤 之 一 。 根 据 Friedman 的 说 法 ， 数 据 可 视 化 的 主要 目标 是 通过 图 
形 手 段 清晰 有 效 地 传达 信息 ,并 不 意味 着 数据 可 视 化 需要 看 起 来 很 无 聊 但 有 用 或 者 看 起 
来 很 漂亮 但 很 复杂 。 为 了 有 效 地 传递 观点 ， 需 要 兼顾 美学 形式 和 功能 ， 能 洞察 极为 稀少 
和 复杂 的 数据 集 ， 以 更 直观 的 方式 传达 其 关键 方面 。 但 设计 者 往往 无 法 在 形式 和 功能 之 
间 取 得 平衡 ， 创 造 华丽 的 数据 可 视 化 形式 并 不 能 满足 其 主要 目的 一 一 传达 信息 。 

事实 上 ，Fernanda Viegas 和 Martin M. Wattenberg 建议 ， 理 想 的 可 视 化 应 该 不 仅 要 
清楚 地 沟通 ， 而 且 可 以 激发 观众 的 参与 和 关注 。 
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数据 可 视 化 与 信息 图 形 、 信 息 可 视 化 、 科 学 可 视 化 、 探 索性 数据 分 析 和 统计 图 形 密 
切 相关 。 在 新 千年 中 ， 数 据 可 视 化 已 成 为 研究 、 教 学 和 发 展 的 活跃 领域 。 
图 11-1 数据 可 视 化 是 分 析 数 据 并 将 其 呈现 给 用 户 的 步骤 之 一 〈 图 略 ) 。 


2. 有 效 图 形 显示 的 特点 


爱德华 。 图 夫 特 (Edward Tufte) 教授 解释 说 ， 信 息 显示 的 用 户 正在 执行 特定 的 分 
析 任 务 ， 如 进行 比较 或 确定 因果 关系 。 信 息 图 形 的 设计 原则 应 该 支持 分 析 任 务 、 显 示 比 
较 或 因果 关系 。 

在 《定量 信息 视觉 显示 》 一 书 中 ， 爱 德 华 。 图 夫 特 定义 了 “图 形 显示 ”和 有 效 显 
示 图 形 原理 ， 他 认为 优秀 的 统计 图 形 包 括 清晰 、 精 确 和 有 效 地 传达 复杂 思想 。 图 形 显 
示 应 该 : 

e 显示 数据 ; 

e 引导 观众 思考 本 质 ， 而 不 是 只 关注 方法 论 、 图 形 设计 、 图 形制 作 技术 等 ; 

e 避免 扭曲 数据 所 说 的 内 容 ; 

e 在 小 的 空间 内 显示 多 的 数字 ; 

© 使 大 数据 集 一 致 ; 

o 鼓励 用 眼睛 去 比较 不 同 的 数据 块 ; 

© 从 多 个 层面 细 述 数据 ， 从 概述 到 细微 结构 ; 

e 提供 相当 明确 的 目的 : 描述 、 探 索 、 制 表 或 装饰 ; 

e 把 数据 集 的 统计 和 口头 描述 紧密 结合 。 

与 传统 的 统计 计算 相 比 ， 图 形 可 以 更 精确 并 更 有 启发 性 。 

不 采用 这 些 原则 可 能 会 导致 误导 性 图 表 ， 从 而 扭曲 信息 或 支持 错误 的 结论 。 不 要 把 
说 明 性 的 关键 词 与 图 像 本 身分 开 ， 那 样 会 要 求 眼 球 在 图 像 与 关键 点 之 问 来 回 移动 。 

国会 预算 办 公 室 在 2014 年 6 月 的 演示 文稿 中 总 结 了 图 形 显示 的 几 种 最 佳 做 法 。 
包括 : 

e 了 解 你 的 观众 ; 

e. 设计 可 以 在 报告 背景 之 外 独立 的 图 形 ; 

o 设计 在 报告 中 传达 关键 信息 的 图 形 。 


3. 定量 信息 


作者 Stephen Few 描述 了 八 种 类 型 的 定量 消息 ， 用 户 可 以 从 用 来 传达 信息 的 数据 集 
和 相关 图 形 中 试图 理解 这 些 信息 ， 并 能 与 之 通信 。 
e 时 间 序 列 : 捕获 在 一 段 时 间 内 一 个 变量 的 值 ， 如 十 年 的 失业 率 。 线 形 图 可 用 于 展 
示 其 趋势 。 
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e HA: 按 升 序 或 降序 排列 细 分 的 类 别 ， 如 单一 时 期 销售 人 员 〈 该 类 别 ， 每 个 销 
售 人 员 为 细 分 类 别 ) 的 销售 业绩 《度量 ) 排名 。 可 以 使 用 条 形 图 显示 销售 人 员 
的 比较 。 
e 部 分 到 全 部 : 以 整体 的 比例 CBN 100% 的 百分比 ) 来 细 分 类 别 。 饼 图 或 条 形 图 可 
以 显示 比例 的 比较 ， 例 如 ， 市 场 中 竞争 对 手 所 代表 的 市 场 份额 。 
e 偏差 : 将 细 分 类 别 与 参考 项 进行 比较 , 例如 对 于 给 定时 间 段 的 几 个 业务 部 门 的 实 
际 费 用 与 预算 额 加 以 比较 。 条 形 图 可 以 显示 实际 量 与 参考 量 的 比较 。 
e 频率 分 布 : 显示 给 定 间隔 的 特定 变量 的 观察 次 数 ， 例 如 股票 市 场 在 0 一 10%、 
11% 一 20% 等 间隔 之 间 的 年 数 。 一 种 称 为 直方 图 的 条 形 图 可 用 于 此 分 析 。boxplot 
可 以 帮助 显示 有 关 分 布 的 关键 统计 信息 ， 如 中 位 数 、 四 分 位 数 及 异常 值 等 。 
e 相关 性 : 比较 两 个 变量 (X, YO 表示 的 观察 值 ， 以 确定 它们 是 否 趋向 于 相同 或 
相反 的 方向 。 例 如 ， 在 几 个 月 的 样本 中 绘制 失业 〈X) 和 通货 膨胀 CY) 的 关系 
图 。 散 布 图 通常 用 于 表示 此 类 消息 。 
e 名 义 比较 : 对 没有 特定 顺序 的 细 分 类 别 加 以 比较 ， 如 按 产 品 代码 比较 销售 量 。 可 
以 使 用 条 形 图 进行 此 类 比较 。 
e 地 理 或 地 理 空间 : 地 图 或 布局 之 间 的 变量 比较 , 例如 国家 的 失业 率 或 建筑 物 各 层 
楼 的 人 数 。 直 方 图 是 常用 的 典型 图 形 。 
审查 一 组 数据 的 分 析 师 可 能 会 考虑 上 述 部 分 或 全 部 消息 和 图 形 类 型 是 否 适用 于 其 
任务 和 受众 。 探 索性 数据 分 析 就 是 在 数据 中 识别 有 意义 的 关系 和 消息 。 
图 11-2 是 一 个 时 间 序 列 说 明 线 图 , 展示 了 美国 联邦 消费 和 收入 随时 间 的 趋势 
(ERM) 。 
图 11-3 是 一 个 散 点 图 ， 显 示 了 在 时 间 点 测量 的 两 个 变量 〈 通 货 膨 胀 和 失业 ) 之 间 
的 负 相 关 《〈 图 略 ) 。 


4， 视 觉 感知 和 数据 可 视 化 


人 可 以 容易 地 区 分 线 长 、 形 状 、 方 向 和 颜色 (色调 ) 上 的 差异 ， 而 不 需要 大 量 的 处 
理工 作 ; 这 些 被 称 为 “前 注意 属性 ”。 例 如 ， 识 别 数字 “5” 出 现在 一 系列 数字 中 的 次 
数 可 能 需要 大 量 的 时 间 和 精力 〈“ 注 意 处 理 ”) ， 但 如 果 该 数字 的 大 小 、 方 向 或 颜色 不 
同 ， 则 可 以 通过 前 注意 处 理 快速 注意 到 该 数字 。 

有 效 的 图 形 利 用 了 前 注意 处 理 和 属性 以 及 这 些 属 性 的 相对 强度 。 例 如 ， 由 于 人 可 以 
更 容易 地 处 理 线路 长 度 与 表面 积 的 差异 ， 使 用 条 形 图 (利用 线 长 度 来 显示 比较 ) 可 能 比 
饼 图 〈 使 用 表面 积 来 显示 比较 ) 更 有 效 。 

几乎 所 有 的 数据 可 视 化 都 是 为 人 类 消费 而 创建 的 。 在 设计 直觉 可 视 化 时 ， 需 要 了 解 
人 的 感知 和 认 知 。 认 知 是 指 人 的 处 理 过 程 ， 如 感知 、 注 意 力 、 学 习 、 记 忆 、 思 维 、 概 念 
形成 、 阅 读 和 解决 问题 。 人 类 视觉 处 理 在 检测 变化 方面 是 有 效 的 , 并 且 能 在 数量 、 大 小 、 
形状 和 亮度 变化 之 间 进行 比较 。 当 符号 数据 的 属性 映射 为 可 视 化 属性 时 ， 人 们 可 以 有 效 
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地 浏览 大 量 的 数据 。 据 估计 ， 大 脑 神经 元 的 2/3 可 以 参与 视觉 处 理 。 正 确 的 可 视 化 提供 
了 一 种 不 同 的 方法 来 显示 在 非 可 视 化 定量 数据 中 并 不 明显 的 潜在 连接 及 关系 等 。 可视化 
可 以 成 为 数据 探索 的 一 种 手段 。 


5. 术语 


数据 可 视 化 涉及 特定 术语 ， 其 中 一 些 来 自 统计 学 。 例 如 ， 作 者 Stephen Few 定义 了 
两 种 类 型 的 数据 ， 它 们 组 合 使 用 以 支持 有 意义 的 分 析 或 可 视 化 。 

e 分 类 : 描述 数据 性 质 的 文本 标签 ， 如“ 名称” 或 “年 龄 ”。 该 术语 还 包括 定性 CIE 
数值 ) 数据 。 

e 定量 : 数字 度量 ， 如 “25” 代 表 年 龄 。 

信息 显示 的 两 种 主要 类 型 是 表格 和 图 表 。 

e 表格 包含 按照 分 类 标签 组 织 成 行 和 列 的 定量 数据 。 它 主要 用 于 查找 特定 值 。 在 上 
面 的 示例 中 ， 表 格 可 能 具有 表示 名 称 〈 定 性 变量 ) 和 年 龄 (定量 变量 ) 的 分 类 列 
标签 ， 每 行 数据 表示 一 个 人 抽样 的 实验 单位 或 细 分 类 别 ) 。 

e. 图 形 主 要 用 于 显示 编码 为 视觉 对 象 ( 例 如， 线条 或 点 ) 的 数据 和 描绘 值 之 间 的 关 
Fro 数值 显示 在 由 一 个 或 多 个 轴 描 绘 的 区 域内 。 这些 轴 提供 了 用 于 标记 和 分 配 视 
觉 对 象 值 的 比例 定量 和 分 类 ) 。 许 多 图 也 被 称 为 图 表 。 


6. 数据 呈现 结构 


数据 呈现 结构 DPA) 是 一 个 技能 集 ， 旨 在 以 适当 的 知识 来 识别 、 定 位 、 操 纵 、 格 
式 化 和 呈现 数据 ， 并 以 最 佳 方式 传达 意义 。 

历史 上 , 术语 数据 呈现 结构 归功 于 凯利 * 劳 特 (Kelly Laut 。 数 据 呈 现 结构 (DPA) 
是 一 种 很 少 应 用 的 技能 ， 对 商业 智能 的 成 功 和 价值 至 关 重要 。DPA 既 不 是 IT， 也 不 是 
业务 技能 ， 而 是 作为 一 个 独立 的 专业 领域 存在 ， 通 常 与 数据 可 视 化 混淆 ， 数 据 呈 现 结构 
是 一 个 更 广泛 的 技能 , 包括 确定 什么 样 的 数据 按照 什么 时 间 表 以 及 何 时 提供 准确 的 格式 ， 
不 仅仅 是 用 最 佳 方式 呈现 已 经 选择 的 数据 。 数 据 可 视 化 技能 是 DPA 的 一 个 要 素 。 


6.1 目标 


DPA 有 两 个 主要 目标 

e. 以 最 有 效 的 方式 使 用 数据 提供 知识 〈 尽 可 能 减少 噪音 、 复 杂 性 和 不 必要 的 数据 或 
详细 信息 ， 以 满足 每 个 受众 的 需求 和 角色 ) 。 

e 以 最 有 效 的 方式 使 用 数据 提供 知识 (以 清晰 易 懂 的 方式 为 每 个 受众 提供 相关 、 及 
时 和 完整 的 数据 ， 这 个 方式 要 有 重要 意义 ， 可 操作 ， 可 理解 ， 并 能 够 影响 其 行为 
和 决策 ) 。 
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基于 上 述 目标 ， 数 据 呈 现 结构 的 实际 工作 包括 : 

e 根据 其 角色 、 任 务 、 位 置 和 访问 技术 ， 为 每 个 受众 成 员 创 建 有 效 的 交付 机 制 。 
© 定义 每 个 观众 在 每 个 环境 中 需要 的 重要 意义 (相关 知识 ) 。 

o 确定 所 需 的 数据 更 新 周期 (数据 的 流通 ) 。 

© 确定 数据 呈现 的 正确 时 机 (用户 需要 查看 数据 的 时 间 和 频率 )。 

e 查找 正确 的 数据 (主题 区 域 、 历 史 范围 、 广 度 、 细 节 级 别 等 )。 

e 利用 适当 的 分 析 、 分 组 、 可 视 化 和 其 他 呈现 格式 。 


63 ”相关 领域 


DPA 也 可 用 于 其 他 几 个 领域 ， 包 括 : 

e 业务 分 析 以 确定 业务 目标 、 收 集 需求 、 过 程 图 示 。 

e 业务 流程 改进 ， 其 目标 是 改进 和 简化 行动 和 决策 ， 促 进 实现 业务 目标 。 

e 数据 可 视 化 ， 它 使 用 完善 的 可 视 化 理论 把 数据 的 意义 或 重要 性 突出 呈现 出 来 。 

e 图 形 或 用 户 设 计 : 使 用 DPA 术语 时 ， 除 非 如 色差 、 造 型 、 品 牌 和 其 他 美学 细节 
特别 需要 、 有 益 沟通 或 影响 其 商业 价值 ， 就 不 考虑 这 些 设计 元 素 。 例 如 : 
(1) 在 演示 页 面 (如 公司 门户 、 报告 或 网 页 ) 中 选择 各 种 数据 表示 元 素 的 位 置 ， 
以 便 为 用 户 传达 层次 结构 、 优 先 级 、 重 要 性 或 合理 的 进展 。 这 是 DPA 技能 集 的 
一 部 分 。 
(2) 选择 在 图 形 元 素 中 提供 特定 颜色 ,表示 特定 意义 或 关注 的 数据 ， 这 也 是 
DPA 技能 集 的 一 部 分 。 

e 信息 架构 , 但 信息 架构 的 重点 是 非 结构 化 数据 ,因此 排除 了 (统计 /数据 意义 上 )》 
的 分 析 ， 并 将 实际 内 容 〈 数 据 、DPA) 直接 转换 为 新 的 实体 和 组 合 。 


Text A 


How to Manage Big Data's Big Security Challenges 


As the amount of data being collected continues to grow, more and more companies are 
building big data repositories to store, aggregate and extract meaning from their data. Big data 
provides an enormous competitive advantage for corporations, helping businesses tailor their 
products to consumer needs, identify and minimize corporate inefficiencies, and share data 
with user groups across the enterprise. With a growth rate of 58 percent in 2017 alone, these 
technologies and their benefits are here to stay. 

Unfortunately, legitimate organizations aren't the only groups that are going big. Large 
sets of consolidated data are a tempting target for cyber attackers. Breaching an organization's 
big data repository can provide criminal groups with bigger payoffs. And when attackers set 
their sights on big data repositories, the effects can be devastating for the affected 
organizations. Terabytes of data in these repositories may include a company's crown 
jewels customer data, employee data, and trade secrets. The recent data breach at Target 
is estimated to cost the company upwards of $1.1 billion, and the PlayStation breach cost 
Sony an estimated $171 million. A breach in a big data repository could be even more 
damaging at a financial institution or healthcare provider, where the value of the data is 
extremely high and government regulations come into play. 


1. The Data 


The variety, velocity and volume of big data amplify the security management challenges 
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that are addressed in traditional security management. Big data repositories will include 
information deposited by various sources across the enterprise. This variety of data makes 
secure access management a challenge. Each data source will have its own access restrictions 
and security policies, making it difficult to balance appropriate security for all data sources 
with the need to aggregate and extract meaning from the data. For example, a big data 
environment may include a dataset with proprietary research information, a dataset requiring 
regulatory compliance, and a separate dataset with personally identifiable information (PII). A 
researcher might want to correlate their research with a dataset including PII data, but what 
restrictions should be in-place to ensure adequate security? Protecting big data requires 
balancing analysis like this with security requirements on a case-by-case basis. 

In addition, many of the repositories collect data at high volumes and velocity from a 
number of different data sources, and they all might have their own data transfer workflows. 
These connections to multiple repositories can increase the attack surface for an adversary. A 
big data system receiving feeds from 20 different data sources may present an attacker with 
20 viable vectors to attempt to gain access to a cluster. 


2. The Infrastructure 


Another big data challenge is the distributed nature of big data environments. Compared 
with a single high-end database server, distributed environments are more complicated and 
vulnerable to attack. When big data environments are distributed geographically, physical 
security controls need to be standardized across all accessible locations. When data scientists 
across the organization want access to information, perimeter protection becomes important 
and complicated to ensure access to users while protecting the system from a possible attack. 
With a large number of servers, there is an increased possibility that the configuration of 
servers may not be consistent — and that certain systems may remain vulnerable. 


3. The Technology 


An additional big data security challenge is that big data programming tools, 
including Hadoop and NoSQL databases, were not originally designed with security in mind. 
For example, Hadoop originally didn’t authenticate services or users, and didn’t encrypt data 
that’s transmitted between nodes in the environment. This creates vulnerabilities for 
authentication and network security. NoSQL databases lack some of the security features 
provided by traditional databases, such as role-based access control. The advantage of NoSQL 
is that it allows for the flexibility to include new data types on the fly, but defining security 
policies for this new data is not straightforward with these technologies. 
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4. Securing Big Data 


So what can be done to help bring the security of traditional database management to big 
data? Several organizations describe and define different security controls. The SANS 
Institute provides a list of 20 security controls. The list contains several controls that I would 
recommend to address the security challenges presented by big data. 

e Application Software Security. Use secure versions of open-source software. As 
described above, big data technologies weren't originally designed with security in 
mind. Using open-source technologies like Apache Accumulo or the .20.20x version 
of Hadoop or above can help address this challenge. In addition, proprietary 
technologies like Cloudera Sentry or DataStax Enterprise offer enhanced security at 
the application layer. Specifically, Sentry and Accumulo also support role-based 
access control to enhance security for NoSQL databases. 

* Maintenance, Monitoring, and Analysis of Audit Logs. Implement audit logging 
technologies to understand and monitor big data clusters. Technologies like Apache 
Oozie can help implement this feature. Keep in mind that security engineers in the 
organization need to be tasked with examining and monitoring these files. It's 
important to ensure that auditing, maintaining, and analyzing logs are done 
consistently across the enterprise. 

* Secure Configurations for Hardware and Software. Build servers based on secure 
images for all systems in your organization's big data architecture. Ensure patching is 
up to date on these machines and that administrative privileges are limited to a small 
number of users. Use automation frameworks, like Puppet, to automate system 
configuration and ensure that all big data servers in the enterprise are uniform and 
secure. 

* Account Monitoring and Control. Manage accounts for big data users. Require strong 
passwords, deactivate inactive accounts, and impose a maximum permitted number of 
failed log-in attempts to help stop attacks from getting access to a cluster. It's 
important to note that the enemy isn't always outside of the organization. Monitoring 
account access can help reduce the probability of a successful compromise from the 
inside. 

Organizations that are serious about big data security should consider these first steps. 
Cyber criminals are never going to stop being on the offensive, and with such a big target to 
protect, it is prudent for any enterprise utilizing big data technologies to be as proactive as 
possible in securing its data. 
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XW New Words 
challenge [tf aelindz] .挑战 
vt. [i] ERR 
repository [ri poziteri] n. FORE, e 
aggregate [egrigeit] vA, RS, Git 
n.&it, Bit, ROK 
ad. Fit, EW, X6 
extract [iks'treekt] wt ATR, 吸取 
tailor [teile] VL 适应 ,适合 
制作 
enormous [ino:mas] adj.B KW, HAW 
inefficiency [inifif ensi] nn. 无 效率 ， 无 能 
consolidated [ken'solideitid] adf. 整 理 过 的 ; 统一 的 ; 加 固 的 
tempting [temptin] adj RAH 
attacker [e'teeka] ne 
recognition [Lrekag'nif en] n RE, KA, FH, AU, KR 
devastating [devesteitin] adi. 破 坏 性 的 ， 全 然 的 
amplify [semplifai] VI 放大， 增强 
deposit [di'pozit] Vi. 存放， 堆积 
Yi. 沉淀 
nn. 堆 积 物 ， 存 放 物 
dataset [deiteset] 17. 数据 集 
regulatory [regjuleteri] adj. 调 整 的 
adequate [aedikwit] adj. 适 当 的 ， 足够 的 
workflow [we:kfleu] n. 工 作 流 
adversary [sedvasari] nF, WF 
configuration [kenfigju'reif en] 1. 构 造 ， 配 置 
authenticate [o:eentikeit] vy 鉴别 
node [neud] n A 
vulnerability [vAlnere'bileti] n. &; 攻击 
straightforward [streit'fo:wad] adj. 坦 率 的 ， 简 单 的 ， 易 懂 的 ， 直截了当 的 
adv. 坦 率 地 
patch [pets] nA T 
vtATART, 3h 


automation [»:te'meif en] .自动 控制 ， 自 动 操作 
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framework [freimwe:k] 

uniform [ju:nifo:m] 

deactivate [di:'eektiveit] 

inactive [in'aektiv] 

probability Lproba'biliti] 

offensive [efensiv] 

prudent [pru:dent] 
XA Phrases 

consumer need 

share with 

crown jewels 


trade secret 

upwards of 

financial institution 
government regulation 
come into play 

on a case-by-case basis 
data transfer 

distributed environment 
programming tool 
role-based access control 
on the fly 

proprietary technology 
application layer 
permit of 

cyber criminal 

as proactive as possible 


XW Abbreviations 


PII (Personally Identifiable Information) 


nA, HR, HN 

adj. 统 一 的 ， 相 同 的 ， 一 致 的 
WER, RGA 
adj. 不 活动 的 ， 停 止 的 
nT, RRE, ME 

adj atin, TLI, De tE iy 
ntk, KH 

adj. iE T8] 


客户 需求 ， 消 费 者 的 要 求 
分 享 ， 分 与 ， 分 派 
核心 业务 ,顶尖 业务 
商业 秘密 ,行业 秘密 
以 上 ; 多 于 

金融 机 构 
政府 管制 ， 政 府 法 规 
开始 活动 

按照 具体 问题 具体 分 析 原 则 
数据 传送 

分 布 环境 

程序 设计 工具 

基于 角色 的 访问 控制 


计算 机 犯罪 
尽 可 能 主动 


个 人 身份 信息 


SANS (SysAdmin, Audit, Network, Security) ”系统 管理 、 稽 核 、 网 络 及 安全 


| Unit 12 633) 


XA Notes 


[1] Big data provides an enormous competitive advantage for corporations, helping businesses 
tailor their products to consumer needs, identify and minimize corporate inefficiencies, 
and share data with user groups across the enterprise. 


本 句 中 , helping businesses tailor their products to consumer needs, identify and minimize 
corporate inefficiencies, and share data with user groups across the enterprise 是 一 个 动 名 
词 短 语 ， 对 an enormous competitive advantage 进行 补充 说 明 。 

[2] A breach in a big data repository could be even more damaging at a financial institution or 
healthcare provider, where the value of the data is extremely high and government 
regulations come into play. 

本 句 中 ，where the value of the data is extremely high and government regulations come 
into play 是 一 个 非 限定 性 定语 从 句 ， 修 饰 和 限定 at a financial institution or healthcare 
provider. 

[3] The variety, velocity and volume of big data amplify the security management challenges 
that are addressed in traditional security management. 

本 句 中 ，that are addressed in traditional security management 是 一 个 定语 从 句 ， 修 饰 和 
限定 the security management challenges. 

[4] With a large number of servers, there is an increased possibility that the configuration of 
servers may not be consistent and that certain systems may remain vulnerable. 

本 句 中 ，that the configuration of servers may not be consistent and that certain systems 
may remain vulnerable 是 and 连接 的 两 个 并 列 句 ， 作 possibility 的 同位 语 ， 对 其 进行 
补充 说 明 。 


XA Exercises 


【Ex. 1】 根据 课文 内 容 回答 问题 。 

1. What are more and more companies doing as the amount of data being collected continues 
to grow? 

2. What amplify the security management challenges that are addressed in traditional security 
management? 

3. What may a big data environment include? 

4. What do physical security controls need to be when big data environments are distributed 
geographically? 

5. What is an additional big data security challenge? 

6. What is the advantage of NoSQL ? 
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7. What are the controls that the author would recommend to address the security challenges 
presented by big data? 

8. What do proprietary technologies like Cloudera Sentry or DataStax Enterprise offer? 

9. What do security engineers in the organization need to be tasked with? 


10. What can monitoring account access do? 


【Ex. 2】 把 下 列 句子 翻译 为 中 文 。 

1. In this case, you can create a data source that retrieves data from the work items in the 
repository. 

2. This is a quick way to have a form that you can edit and tailor according to your requirements. 

3. An attacker who successfully exploited this vulnerability could run arbitrary code as the 
logged-on user. 

4. We are using this transistor to amplify a telephone signal. 

5. The dataset might also contain another table with order information. 

6. This establishes a workflow between use cases. 

7. Consequently, each registered base node might have different user registries configured if 
security is enabled. 

8. Older machines will need a software patch to be loaded to correct the date. 

9. Administrative staff may be deskilled through increased automation and efficiency. 

10. At this point, you may activate or deactivate whatever other plugins you wish. 


【Ex. 3】 短文 翻译 。 

Big Data is the current buzzword in the technology sector, but in fields such as security it 
is much more than this. Businesses are starting to bet strongly on the implementation of tools 
based on the collection and analyzing of large volumes of data to allow them to detect 
malicious activity. What started out at a fashionable term has turned into a fundamental part 
of how we operate. 

So, what exactly are the advantages of Big Data? Well, have a think about the current 
situation in which the use of mobile devices is growing, the Internet of Things has arrived, the 
number of Internet users is reaching new highs, and quickly you realize that all of this is 
prompting an increase in the number of accesses, transactions, users, and vulnerabilities for 
technology systems. This results in a surge in raw data (on the World Wide Web, on 
databases, or on server logs), which is increasingly more complex and varied, and generated 
rapidly. 

Given these circumstances, we are encouraged to adopt tools that are capable of 
capturing and processing all of this information, helping to visualize its flow and apply 
automatic learning techniques that are capable of discovering patterns and detecting anomalies. 
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[Ex 4】 将 下 列 词 填 入 适当 的 位 置 〈 每 词 只 用 一 次 )。 


monitoring exposure relieves reputational mobile 


requirements organization storage independently growing 


Key Challenges for Big Data Security 


* Cyber Criminals. As it becomes bigger and more difficult to manage, big data 
consequently becomes more appealing to hackers and cyber criminals. Because big 
data is a dataset of unprecedented size with centralized access, any _ (1) _ is total 
exposure. These types of breaches make headlines, incite consumers, and may cause 
major (2) „legal, and financial damage. 

* Resource Capacity. As an organization collects big data across channels at an 
exponential rate, their — (3) can grow beyond terabytes. As a result, data 
encryption and migration can get bottle-necked or leaky. Additionally, the sheer 
volume of data makes implementation of security control unwieldy. The tools required 
for (4). and analyzing big data produce massive amounts of their own 
security-related data every day, which puts undue pressure on the organization's 
capacity to store and analyze it all. 

* Cloud and Remote Access. One answer to the capacity issues of big data is to put it in 
the cloud. This _ (5) | some of the burden for storage and processing, but creates 
new challenges for protecting it from criminals. And as more businesses allow for 
flex-time and , (6) _ offices, employees have access to sensitive company data via 
smart phones, tablet devices, and home laptops. Protecting personal devices becomes a 
balancing act between security and productivity. 

* Supply Chain and Partner Security. Organizations rarely operate _ (7) _. They rely 
on supply chain partners and external vendors for many of their business functions. 
Information flows in and out of each _ (8) — to keep these relationships functioning. 
Coordinating the safety of big data across partners is another layer of complexity to a 
business's information security challenges. 

* Privacy. Both private and public organizations face the _ (9) — challenge of privacy 
concerns. Consumers are wary about personal information being collected and stored, 
and fearful about security breaches. Plus, there are legislative and regulatory _ (10) — to 
keep in mind. 
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Text B 


The Future of Big Data—Big Data 2.0 


For data geeks like myself, it has been a hell of a ride. The rise of big data in marketing 
and media has brought great interest and excitement to people. Finally, the creative directors, 
C-suite, and account leaders are leaning on the data scientists once again to provide deep 
consumer understanding and insights that are backed up and proven by actual consumers. 

Today, clients often ask me about the future of big data and what the next step is; how 
can we leverage data on an even deeper level in order to extract meaningful consumer insights 
that go beyond where we are now? Most of the standard answers are around the ability to get 
data and insights in real time and from more devices than ever. While it is true that the 
connected homes, wearables, and connected cars will allow us to collect a much wider set of 
data points, I believe that this is just an extension of the existing approach. 

It’s time we move beyond structured data and into the prime time of text analytics. 
Here’s why. 


1. Numeric vs. Emotional 


Most of the data points collected today are numerical or binary. They tell us if somebody 
engaged with a site, how well, how long, and where they engaged, but the data fails to tell 
us why. I believe the future of big data—Big Data 2.0 (to coin a term)—is not about more 
binary and numeric data points, but instead about asking the deeper questions. Big Data 2.0 
should be focused not on what and where but on answering why. It should be concerned with 
getting a better understanding of the consumer’s emotional state and the decision logic, and 
thereby provide deeper insight into the consumers’ choices. If we focus on why instead of 
how often, we can create more meaningful, quality connections between consumers and 
brands. In other words, while numbers are great indicators of performance, focusing solely on 
them means brands miss the element of human connection. 

Take Amazon data as an example. Amazon is filled with great numerical indicators. Its 
data can tell us the sales ranks (how many sold relative to category), the customer engagement 
(how many people shared product reviews), and their satisfaction with the product (the 
positive and negative reviews). All of these are great indicators, but they are still very simple 
and only tell a small part of the story. 

Let’s assume we are a consumer packaged goods company and we want to introduce a 
new line of diapers into the market. We decide to look at Amazon in order to better 
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understand which products are category leaders (sales rank and number of sales) and how the 
consumers like the product itself (reviews). If we analyze these metrics across all diapers, we 
have a Big Data 1.0 picture that tells us exactly who sells the most and what the audience 
favorite is. 

This is not enough anymore; Big Data 2.0 needs to be about the why: Why is a particular 
product the most sold? Why does it have an average rating of 5? 


2. What’s the Solution? 


For us, the easiest way to get started with Big Data 2.0 is to focus on the unstructured 
data we collect every day. This can be reviews, customer support emails, community forums, 
even your own CRM system. The simplest way to look at this data is through a process called 
text analytics. 

Text analytics is a fairly straightforward process that breaks out like this: 

(1) Acquisition: Collecting and aggregating the raw data you want to analyze 

(2) Transforming & Preprocessing: Cleaning and formatting the data to make it easier to 
read 

(3) Enrichment: Enhancing the data by adding additional data points 

(4) Processing: Performing specific analyses and classifications on the data 

(5) Frequencies & Analysis: Evaluation of the results and translation into numerical 
indicators 

(6) Mining: Actual extraction of information 


3. Real-World Uses 


Here’s a real-world application using our example above. We are trying to understand 
the diaper market. In order to not turn this into a step-by-step guide, let’s assume that we 
already have collected all diapers reviews as well as their qualitative indicators. That means 
we know what sells best and what ranks best/worst. In order to take this to the next level, we 
would start to extract words and phrases from the reviews. This will tell us some of the 
recurring patterns and their frequencies within the reviews. I actually performed this analysis 
by evaluating thousands of reviews and found three very actionable insights we would have 


never gotten to without text analytics. 
3.1 Why did it sell so well? 


When I looked at the reviews of the top-selling product, I found that the most mentioned 


» u 


terms across the majority of the helpful reviews were “price, special," and "value." This 
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tells us that people did not buy it because of its quality or features, but because of its pricing. 
So when we are launching our product, we want to look at this one for price/value guidance 
instead of features. 


3.2 Why didn’t people like it? 


This one was very revealing. The brand with the most negative reviews had an extremely 
high frequency around the terms “tape,” “stick,” “stay closed,” and “open.” After a few 
reads, I discovered that consumers had no issues with the usual key features on a diaper such 
as “absorbency,” “leakage,” or “softness,” but actually had issues with the tape on the side 
of the diaper, and the fact that it kept opening. The amount of negative reviews overall that 
mentioned these issues makes us believe that this is a feature that brands don’t talk about but 
consumers care about. 


3.3 Smart filtering 


One interesting issue we came across is the fact that a lot of negative reviews were not 
actually about the product but rather focused on shipping, stock level, and packaging concerns. 
By tagging and removing these from the set, we are able to evaluate purely on a product level 
in order to focus on product-related concerns. If we were to list our diaper on Amazon, we 
would recommend adding a shipping and stock level guarantee prominently in the copy — a 
competitive advantage that speaks directly to consumer concerns. 


3.4 What do they want? 


From an R&D perspective, this insight is worth gold. By evaluating reviews that have 
terms like“I wish, ”“ hope, "or they should, "we are able to detect common features consumers 
are looking for when thinking about diapers. These are great insights that address the 
constantly changing need of the consumers. We can feed these product feature-specific 
insights to our R&D team as well as our copywriters. 

As you can see, when analyzing the diaper category just on Amazon alone, Big Data 2.0 
yielded insights beyond binary performance indicators. We could see the crowd favorites but 
did not (yet) know the" why " behind purchases, or understand the positive or negative reviews 
until our text analytics exercise. There are countless consumer insights to be mined from 
textual, unstructured data that give us the voice of the consumer, their motivations, and a 
deeper understanding of their purchasing behavior. 

I hope the above examples and thoughts would give you some good ideas and inspiration 
on how to think about text analytics for your organization and projects. Start looking at your 
existing data, export your CRM, examine your comments on your website or products 
mentioned in topic forums — even emails from your sales department's inbox. It's Big Data 
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2.0 time and that's where you'll find the gold. 


XW New Words 
geek [gi:k] 17. 极 客 
insight [insait] .洞察 力 ， 见 识 
leverage [li:vərid3] n. 杠 杆 作 用 
vp BE, UR BA 
wearable [weerebl] adj. 可 穿 用 的 ， 可 佩戴 的 
collect [ka'lekt] vie, XE, Hop, RE 
extension [iks'tenf en] ni k, «EK 
.扩展 名 
emotional [i'məuf ənəl] adj. 情 绪 的 ， 情 感 的 
binary [bainari] adj. 二 进 制 的 ， 二 进位 的 
engaged [in'geidsd] adj. 使 用 中 的 
indicator ['indikeita] nn. 指 标 ; 指示 器 ， 指 示 符 
Tank [reenk] 17. 排行 
Vt 排列 ， 归 类 于 ， 把 …… 分 等 级 
positive [pozitiv] adj. 肯 定 的， 积极 的 
negative [negativ] adj. 否 定 的 ， 消 极 的 
forum [fo:rem] nits 
acquisition [aekwiziJan] 1. 收 集 ， 收 获 
formatting [fo:maetin] .格式 化 
enrichment [in'ritf ment] n.i Hi 
frequency [fri:kwensi] nn. 频率 ， 发 生 次 数 
qualitative [kwolitetiv] adi. 定 性 的 
actionable [aekJ enabl] adj. 可 行动 的 ， 可 执行 的 
revealing [ri vi:lin] adj. 有 启迪 作用 的 ; 给 人 启发 的 ; 透露 真情 的 
absorbency [əb'sə:bənsi] n BOE, BOBCAT 
leakage [li:kidz] nie, HR. Bae 
softness [‘softnis] 1. 柔 和， 和 柔软 
guarantee [.geeran'ti:] nn. 保 证 ， 保 证书， 担保 
Wt. 保证， 担保 
competitive [kam'petitiv] adj. Z 4 E] 
copywriter [kopiraite] 7. 广告 文案 作者 ， 广 告 写 手 
countless [kautlis] adj ARH, BRERA 


motivation [;/meuti'veif en] n.5 dL 
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XA Phrases 


a hell ofa 
C-suite 


back up 

real time 

prime time 

engage with ... 

coin a term 

be concerned with 
emotional state 
decision logic 

be filled with 
customer engagement 
product review 
community forum 
text analytic 

raw data 

actionable insight 
smart filtering 
competitive advantage 
sales department 


XA Abbreviations 


CRM (Customer Relationship Management) 


XA Exercises 


(用 来 加 重 语气 ) 极 恶劣 的 ， 不 像样 的 ， 使 人 受 不 了 的 
C 型 雇员 ， 指 企业 最 高 管理 层 。 因 其 英文 名 称 开头 字母 
都 带 C， 因 而 得 名 。 

支持 

实时 

黄金 时 间 ， 黄 金 时 段 


创造 一 个 词汇 
RN, 关注 

情绪 状态 

决策 逻辑 

充满 着 

客户 互动 

产品 评价 ， 商 品评 论 
社区 论坛 ， 社 团 论坛 
文本 分 析 

原始 数据 

可 执行 的 结论 
智能 过 滤 

竞争 优势 
销售 部 ， 营 业 部 


客户 关系 管理 


【Ex. 5 】 根据 课文 内 容 回答 问题 。 
1. What question do clients often ask the author? 
2. What are most of the data points collected today? 


3. What does the author believe? 


4. What should Big Data 2.0 be focused on? 
5. What can we do if we focus on why instead of how often? 
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6. What is the simplest way to look at this data? 

7. What is text analytics? 

8. What were the most mentioned terms across the majority of the helpful reviews? 

9. What did the brand with the most negative reviews had an extremely high frequency 
around? 

10. What did Big Data 2.0 do when analyzing the diaper category just on Amazon alone? 


参考 译文 


如 何 管 理 大 数据 的 大 安全 挑战 


随 着 正在 收集 的 数据 量 不 断 增加 , 越 来 越 多 的 公司 正在 构建 大 数据 存储 库 来 存储 和 
汇总 数据 并 提取 其 意义 。 大 数据 为 企业 提供 了 巨大 的 竞争 优势 ， 帮 助 企业 根据 消费 者 需 
求 定制 产品 , 识别 并 最 大 限度 地 减少 企业 低 效率 行为 , 并 实现 企业 中 用 户 群 体 共享 数据 。 
AX 2017 年 其 增长 率 就 达到 了 58%， 这 些 技术 及 其 好 处 将 继续 显现 。 

不 幸 的 是 , 并 不 是 只 有 合法 组 织 在 壮大 。 整理 好 的 大 数据 集 对 网 络 攻击 者 极 具 诱 惑 。 
破解 组 织 的 大 数据 库 可 以 为 犯罪 集团 提供 更 大 的 回报 。 当 攻击 者 将 目光 瞄准 大 型 数据 存 
储 库 时 ， 对 受 影响 的 组 织 可 能 带 来 毁灭 性 的 后 果 。 这 些 存 储 库 中 的 数据 可 能 包括 公司 的 
核心 机 密 客 户 数据 、 员 工 数据 和 商业 秘密 。 最 近 的 目标 公司 数据 泄露 估计 会 导致 公司 损 
失 高 达 11 亿美 元 ,而 PlayStation 的 数据 泄露 估计 给 索尼 公司 造成 了 1.7 亿美 元 的 损失 。 
大 数据 库 中 的 数据 泄露 可 能 会 给 金融 机 构 或 医疗 保健 提供 者 造成 更 大 破坏 , 因为 其 数据 
价值 极 高 ， 政 府 也 实施 了 一 些 法 律 规章 。 


1. 数据 


大 数据 的 种 类 、 速 度 和 数量 使 传统 安全 管理 面临 更 大 的 挑战 。 大 型 数据 存储 库 将 
存储 企业 内 各 种 来 源 的 信息 。 这 些 数 据 使 安全 访问 管理 面临 严峻 挑战 。 每 个 数据 源 将 
有 自己 的 访问 限制 和 安全 策略 ， 这 使 得 平衡 所 有 数据 源 的 安全 性 尤为 困难 ,特别 是 在 
需要 从 数据 中 聚合 和 提取 有 意义 的 信息 更 为 不 易 。 例 如 ， 大 数据 环境 可 能 包括 专 有 研 
究 信 息 的 数据 集 、 需 要 遵守 法 规 的 数据 集 以 及 个 人 身份 信息 (PIT) 的 单独 数据 集 。 研 
究 人 员 可 能 希望 将 其 研究 与 包括 个 人 身份 信息 的 数据 集 相关 联 , 但 是 应 该 采取 什么 限 
制 措施 来 确保 充分 的 安全 性 ? 保护 大 数据 需要 根据 具体 情况 按照 安全 要 求 进行 综合 
平衡 。 

此 外 ,许多 存储 库 从 许多 不 同 的 数据 源 以 高 容量 和 速度 收集 数据 ， 并 且 它 们 都 可 能 
具有 自己 的 数据 传输 工作 流程 。 与 多 个 存储 库 的 连接 可 能 增加 对 手 的 攻击 面 。 从 20 个 
不 同 的 数据 源 接收 馈送 的 大 数据 系统 可 以 向 攻击 者 提供 20 个 可 行 的 向 量 来 尝试 访问 数 
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据 集群 。 


2. 基础 设施 


另 一 个 大 数据 挑战 是 分 布 式 的 大 数据 环境 。 与 单一 高 端 数据 库 服务 器 相 比 ， 分 布 式 
环境 更 加 复杂 ， 也 易 受 攻击 。 当 大 数据 环境 分 布 在 不 同 的 地 理 位 置 时 ， 需 要 在 所 有 可 访 
问 的 位 置 进行 标准 化 的 物理 安全 控制 。 当 组 织 中 的 数据 科学 家 希望 访问 信息 时 ,边界 保 
护 变 得 非常 重要 和 复杂 ， 以 便 确保 用 户 访问 和 系统 免 受 可 能 的 攻击 。 使 用 大 量 的 服务 器 
时 ， 服 务 器 的 配置 可 能 不 一 致 一 其 中 的 某 些 系统 可 能 易 受 攻击 。 


3. 技术 


还 有 一 个 大 数据 安全 挑战 是 大 数据 编程 工具 , 包括 Hadoop 和 NoSQL 数据 库 , 它们 
最 初 并 没有 考虑 安全 性 。 例 如 ，Hadoop 原来 没有 对 服务 或 用 户 进行 身份 验证 ， 并 且 没 
有 对 在 环境 中 的 节点 之 间 传 输 的 数据 进行 加 密 。 这 会 产生 身份 验证 和 网 络 安全 漏洞 。 
NoSQL 数据 库 缺少 传统 数据 库 提供 的 一 些 安全 功能 , 例如 基于 角色 的 访问 控制 . NoSQL 
的 优点 在 于 它 的 灵活 性 ， 允 许 包 括 新 的 数据 类 型 ， 但 是 这 些 技术 并 没有 为 这 些 新 数据 制 
定安 全 策略 。 


4. 保护 大 数据 


如 何 把 传统 数据 库 管理 的 安全 性 带 到 大 数据 中 ? 几 个 组 织 描述 和 定义 了 不 同 的 安 
全 控制 。SANS 研究 所 提供 了 20 项 安全 控制 。 下 面 列 出 解决 大 数据 提供 的 安全 挑战 的 
几 个 控件 。 
e 应 用 软件 安全 性 。 使 用 开源 软件 的 安全 版 本 。 如 上 所 述 ， 大 数据 技术 最 初 并 没有 
考虑 安全 性 。 使 用 Apache Accumulo 或 .20.20x 版 本 的 Hadoop 或 更 高 版 本 等 开源 
技术 可 以 应 对 这 一 挑战 。 此 外 ， 像 Cloudera Sentry 或 DataStax Enterprise 这 样 的 
专 有 技术 在 应 用 层 提供 了 增强 的 安全 性 。 具 体 来 说 ，Sentry 和 Accumulo 还 支持 
基于 角色 的 访问 控制 ， 以 增强 NoSQL 数据 库 的 安全 性 。 
e 审核 日 志 的 维护 、 监 控 和 分 析 。 实 施 审计 记录 技术 来 了 解 和 监控 大 型 数据 集群 。 
像 Apache Oozie 这 样 的 技术 可 以 帮助 实现 这 一 功能 。 请 记 住 , 组 织 中 的 安全 工程 
师 需 要 负责 检查 和 监视 这 些 文 件 。 确 保 在 整个 企业 中 一 致 地 进行 审计 、 维 护 和 分 
析 日 志 ， 这 非常 重要 。 
o 硬件 和 软件 的 安全 配置 。 基于 组 织 大 数据 架构 中 所 有 系统 的 安全 映像 构建 服务 器 。 
确保 在 这 些 计 算 机 上 及 时 更 新 补丁 ， 并 且 只 给 少量 用 户 管理 权限 。 使 用 像 Puppet 
这 样 的 自动 化 框架 来 自动 化 系统 配置 , 并 确保 企业 中 的 所 有 大 型 数据 服务 器 是 统 
一 和 安全 的 。 
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e 账户 监控 。 管 理 大 数据 用 户 的 账户 。 需 要 强大 的 密码 ， 关 闭 不 活动 的 账户 ， 并 利 
用 最 大 允许 的 登录 失败 次 数 来 阻止 对 群集 攻击 。 尤 为 重要 的 是 , 要 注意 敌人 并 不 
总 是 在 组 织 之 外 。 监 控 账户 访问 可 以 降低 内 部 威胁 的 可 能 性 。 
对 注重 大 数据 安全 的 组 织 来 说 ， 应 该 考虑 这 些 控件 。 网 络 犯罪 分 子 永远 不 会 停止 进 
攻 ， 而 且 如 果 有 这 么 大 的 保护 目标 ， 任 何 利用 大 数据 技术 的 企业 都 应 该 尽 可 能 地 保护 数 
据 的 安全 。 


